Last Updated: 2023-01-14

URL https://rstudio.cloud/learn/primers

GitHUb: https://github.com/rstudio-education/primers

These interactive tutorials were all created using the learnr package. If you would like to learn how to create your own tutorials, visit the learnr site at https://rstudio.github.io/learnr/.

1 The Basics

Start here to learn the skills that you will rely on in every analysis (and every primer that follows): how to inspect, visualize, subset, and transform your data, as well as how to run code.

If you’re ready to begin, go to the first tutorial. There is no need to install or download anything. Each tutorial has everything you need to write and run R code, right in the tutorial.

1.1 Data Visualization Basics

Start here and begin making plots with R. Plots are one of the most important tools for data science; they are also one of the most fun.

1.1.1 Welcome

Visualization is one of the most important tools for data science.

It is also a great way to start learning R; when you visualize data, you get an immediate payoff that will keep you motivated as you learn. Afterall, learning a new language can be hard!

This tutorial will teach you how to visualize data with R’s most popular visualization package, ggplot2.

The tutorial focuses on three basic skills:

  1. How to create graphs with a reusable template
  2. How to add variables to a graph with aesthetics
  3. How to make different “types” of graphs with geoms

In this tutorial, we will use the core tidyverse packages, including ggplot2. I’ve already loaded the packages for you, so let’s begin!

These examples are excerpted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.

1.1.2 A code template

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

Let’s begin with a question to explore.

What do you think: Do cars with bigger engines use more fuel than cars with smaller engines?

  • Cars with bigger engines use more fuel. ✓
  • Cars with bigger engines use less fuel. ✓
Great! 

In other words, there is a positive relationship between engine size and fuel efficiency. Now let's test your hypothesis against data.

1.1.2.1 mpg

You can test your hypothesis with the mpg dataset that comes in the ggplot2 package. mpg contains observations collected on 38 models of cars by the US Environmental Protection Agency.

To see the mpg data frame, type mpg in the code block below and click “Submit Answer”.

# mpg is a data in the ggplot2 package. So required to load ggplot2 which is in tidyverse.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows
"Good job! We'll use interactive code chunks like this throughout these tutorials. 
Whenever you encounter one, you can click Submit Answer to run (or re-run) the code in the chunk. 
If there is a Solution button, you can click it to see the answer."

You can use the black triangle that appears at the top right of the table to scroll through all of the columns in mpg.

Among the variables in mpg are:

  1. displ, a car’s engine size, in liters.
  2. hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low mpg consumes more fuel than a car with a high mpg when they travel the same distance.

Now let’s use this data to make our first graph.

1.1.2.2 A plot

The code below uses functions from the ggplot2 package to plot the relationship between displ and hwy.

To see the plot, click “Run Code.”

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Can you spot the relationship?

1.1.2.3 And the answer is…

The plot shows a negative relationship between engine size (displ) and fuel efficiency (hwy). Points that have a large value of displ have a small value of hwy and vice versa.

In other words, cars with big engines use more fuel. If that was your hypothesis, you were right!

Now let’s look at how we made the plot.

1.1.2.4 ggplot()

Here’s the code that we used to make the plot. Notice that it contains three functions: ggplot(), geom_point(), and aes().

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

In R, a function is a name followed by a set of parentheses. Many functions require special information to do their jobs, and you write this information between the parentheses.

1.1.2.5 ggplot

The first function, ggplot(), creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph.

By itself, ggplot(data = mpg) creates an empty graph, which looks like this.

ggplot(data = mpg)

1.1.2.6 geom_point()

geom_point() adds a layer of points to the empty plot created by ggplot(). This gives us a scatterplot.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

#### mapping = aes()

geom_point() takes a mapping argument that defines which variables in your dataset are mapped to which axes in your graph. The mapping argument is always paired with the function aes(), which you use to gather together all of the mappings that you want to create.

Here, we want to map the displ variable to the x axis and the hwy variable to the y axis, so we add x = displ and y = hwy inside of aes() (and we separate them with a comma).

Where will ggplot2 look for these mapped variables? In the data frame that we passed to the data argument, in this case, mpg.

1.1.2.7 A graphing workflow

Our code follows the common workflow for making graphs with ggplot2. To make a graph, you:

  1. Start the graph with ggplot()
  2. Add elements to the graph with a geom_ function
  3. Select variables with the mapping = aes() argument

1.1.2.8 A graphing template

In fact, you can turn our code into a reusable template for making graphs. To make a graph, replace the bracketed sections in the code below with a data set, a geom_ function, or a collection of mappings.

Give it a try! Replace the bracketed sections with mpg, geom_boxplot, and x = class, y = hwy to make a slightly different graph. Be sure to delete the # symbols before you run the code.

# ggplot(data = <DATA>) + 
#  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
ggplot(data = mpg) + 
  geom_boxplot(mapping = aes(x = class, y = hwy))

"Good job! This plot uses boxplots to compare the fuel efficiencies of different types of cars. ggplot2 comes with many geom functions that each add a different type of layer to a plot. You'll learn more about boxplots and other geoms in the tutorials that follow."

1.1.2.9 Common problems

As you start to run R code, you’re likely to run into problems. Don’t worry — it happens to everyone. I have been writing R code for years, and every day I still write code that doesn’t work!

Start by carefully comparing the code that you’re running to the code in the examples. R is extremely picky, and a misplaced character can make all the difference. Make sure that every ( is matched with a ) and every ” is paired with another “. Also pay attention to capitalization; R is case sensitive.

1.1.2.10 + location

One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of a line, not the start. In other words, make sure you haven’t accidentally written code like this:

ggplot(data = mpg) 
+ geom_point(mapping = aes(x = displ, y = hwy))

1.1.2.11 help

If you’re still stuck, try the help. You can get help about any R function by running ?function_name in a code chunk, e.g. ?geom_point. Don’t worry if the help doesn’t seem that helpful — instead skip down to the bottom of the help page and look for a code example that matches what you’re trying to do.

If that doesn’t help, carefully read the error message that appears when you run your (non-working) code. Sometimes the answer will be buried there! But when you’re new to R, you might not yet know how to understand the error message. Another great tool is Google: try googling the error message, as it’s likely someone else has had the same problem, and has gotten help online.

1.1.2.12 Exercise 1

Run ggplot(data = mpg) what do you see?

ggplot(data = mpg)

"Good job! A ggplot that has no layers looks blank. To finish the graph, add a geom function."

1.1.2.13 Exercise 2

Make a scatterplot of cty vs hwy.

ggplot(data = mpg) +
  geom_point(aes(x = cty, y = hwy))

"Excellent work!"

1.1.2.14 Exercise 3

What happens if you make a scatterplot of class vs drv. Try it. Why is the plot not useful?

ggplot(data = mpg) +
  geom_point(aes(x = class, y = drv))

"Nice job! `class` and `drv` are both categorical variables. As a result, points can only appear at certain values, where many points overlap each other. You have no idea how many points fall on top of each other at each location. Experiment with geom_count() to find a better solution."

1.1.3 Aesthetic mappings

“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

1.1.3.1 A closer look

In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend between engine size and gas mileage. These cars have a higher mileage than you might expect. How can you explain these cars?

image

1.1.3.2 A hypothesis

Let’s hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the class value for each car. The class variable of the mpg dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular). To check this, we need to add the class variable to the plot.

1.1.3.3 Aesthetics

You can add a third variable, like class, to a two dimensional scatterplot by mapping it to a new aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.

You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to describe data, let’s use the word “level” to describe aesthetic properties. Here we change the levels of a point’s size, shape, and color to make the point small, triangular, or blue:

image

1.1.3.4 A strategy

We can add the class variable to the plot by mapping the levels of an aesthetic (like color) to the values of class. For example, we can color a point green if it belongs to the compact class, blue if it belongs to the midsize class, and so on.

Let’s give this a try. Fill in the blank piece of code below with color = class. What happens? Delete the commenting symbols (#) before running your code. (If you prefer British English, you can use colour instead of color.)

# ggplot(data = mpg) + 
#   geom_point(mapping = aes(x = displ, y = hwy, ____________))
ggplot(data = mpg) + 
   geom_point(mapping = aes(x = displ, y = hwy, color = class))

1.1.3.5 And the answer is…

The colors reveal that many of the unusual points in mpg are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.

This isn’t the only insight we’ve gleaned; you’ve also learned how to add new aesthetics to your graph. Let’s review the process.

1.1.3.6 Aesthetic mappings

To map an aesthetic to a variable, set the name of the aesthetic equal to the name of the variable, and do this inside mapping = aes(). ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable. ggplot2 will also add a legend that explains which levels correspond to which values.

This insight gives us a new way to think about the mapping argument. Mappings tell ggplot2 more than which variables to put on which axes, they tell ggplot2 which variables to map to which visual properties. The x and y locations of each point are just two of the many visual properties displayed by a point.

1.1.3.7 Other aesthetics

In the above example, we mapped color to class, but we could have mapped size to class in the same way.

Change the code below to map size to class. What happens?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.

"Great Job! Now the size of a point represents its class. Did you notice the warning message? ggplot2 gives us a warning here because mapping an unordered variable (class) to an ordered aesthetic (size) is not a good idea."

1.1.3.8 alpha

You can also map class to the alpha aesthetic, which controls the transparency of the points. Try it below.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
  ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.

"Great Job! If you look closely, you can spot something subtle: many locations contain multiple points stacked on top of each other (alpha is additive so multiple transparent points will appear opaque)."

1.1.3.9 Shape

Let’s try one more aesthetic. This time map the class of the points to shape, then look for the SUVs. What happened?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (`geom_point()`).

"Good work! What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic. So only use it when you have fewer than seven groups."

1.1.3.10 Exercise 1

In the code below, map cty, which is a continuous variable, to color, size, and shape. How do these aesthetics behave differently for continuous variables, like cty, vs. categorical variables, like class?

# Map cty to color
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

# Map cty to size
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

# Map cty to shape
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
# Map cty to color
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = cty))

# Map cty to size
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = cty))

# Map cty to shape
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = cty))
A continuous variable can not be mapped to shape
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (`geom_point()`).

"Very nice! ggplot2 treats continuous and categorical variables differently. Noteably, ggplot2 supplies a blue gradient when you map a continuous variable to color, and ggplot2 will not map continuous variables to shape."

1.1.3.11 Exercise 2

Map class to color, size, and shape all in the same plot. Does it work?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class, size = class, shape = class))
## Warning: Using size for a discrete variable is not advised.
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (`geom_point()`).

"Very nice! ggplot2 can map the same variable to multiple aesthetics."

1.1.3.12 Exercise 3

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Try it.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))

"Good job! ggplot2 will map the aesthetic to the results of the expression. Here, ggplot2 mapped the color of each point to TRUE or FALSE based on whther the point's `displ` value was less than five."

1.1.3.13 Setting aesthetics

What if you just want to make all of the points in your plot blue, like in the plot below?

image You can do this by setting the color aesthetic outside of the aes() function, like this

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

1.1.3.14 Setting vs. Mapping

Setting works for every aesthetic in ggplot2. If you want to manually set the aesthetic to a value in the visual space, set the aesthetic outside of aes().

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue", shape = 3, alpha = 0.5)

If you want to map the aesthetic to a variable in the data space, map the aesthetic inside aes().

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = fl, alpha = displ))

1.1.3.15 Exercise 4

What do you think went wrong in the code below? Fix the code so it does something sensible.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

"Good job! Putting an aesthetic in the wrong location is one of the most common graphing errors. Sometimes it helps to think of legends. If you will need a legend to understand what the color/shape/etc. means, then you should probably put the aesthetic inside `aes()` --- ggplot2 will build a legend for every aesthetic mapped here. If the aesthetic has no meaning and is just... well, aesthetic, then set it outside of `aes()`."

1.1.3.16 Recap

For each aesthetic, you associate the name of the aesthetic with a variable to display, and you do this within aes().

Once you map a variable to an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.

You’ve experimented with the most common aesthetics for points: x, y, color, size, alpha and shape. Each geom uses its own set of aesthetics (you wouldn’t expect a line to have a shape, for example). To find out which aesthetics a geom uses, open its help page, e.g. ?geom_line.

This raises a new question that we’ve only brushed over: what is a geom?

1.1.4 Geometric objects

1.1.4.1 Geoms

How are these two plots similar?

image1 image1

Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms.

A geom is the geometrical object that a plot uses to represent observations. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.

As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.

1.1.4.2 Geom functions

To change the geom in your plot, change the geom function that you add to ggplot(). For example, take this code which makes the plot on the left (above), and change geom_point() to geom_smooth(). What do you get?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

"Good job! You get the plot on the right (above)."

1.1.4.3 More about geoms

ggplot2 provides over 30 geom functions that you can use to make plots, and extension packages provide even more (see https://exts.ggplot2.tidyverse.org/gallery/ for a sampling). You’ll learn how to use these geoms to explore data in the Visualize Data primer.

Until then, the best way to get a comprehensive overview of the available geoms is with the ggplot2 cheatsheet. To learn more about any single geom, look at its help page, e.g. ?geom_smooth.

1.1.4.4 Exercise 1

What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

1.1.4.5 Exercise 2

What does the se argument to geom_smooth() do?

  • Nothing. se is not an argument of geom_smooth() ✗
  • chooses a method for calculating the smooth line ✗
  • controls whether or not to show errors ✗
  • Adds or removes a standard error ribbon around the smooth line ✓

1.1.4.6 Putting it all together

The ideas that you’ve learned here: geoms, aesthetics, and the implied existence of a data space and a visual space combine to form a system known as the Grammar of Graphics.

The Grammar of Graphics provides a systematic way to build any graph, and it underlies the ggplot2 package. In fact, the first two letters of ggplot2 stand for “Grammar of Graphics”.

1.1.4.7 The Grammar of Graphics

The best way to understand the Grammar of Graphics is to see it explained in action:

Video: https://vimeo.com/223812632

1.1.5 The ggplot2 package

What is a package?

Throughout this tutorial, I’ve referred to ggplot2 as a package. What does that mean?

The R language is subdivided into packages, small collections of data sets and functions that all focus on a single task. The functions that we used in this tutorial come from one of those packages, the ggplot2 package, which focuses on visualizing data.

1.1.5.1 What should you know about packages?

When you first install R, you get a small collection of core packages known as base R. The remaining packages—there are over 10,000 of them—are optional. You don’t need to install them unless you want to use them.

ggplot2 is one of these optionals packages, so are the other packages that we will look at in these tutorials. Some of the most popular and most modern parts of R come in the optional packages.

You don’t need to worry about installing packages in these tutorials. Each tutorial comes with all of the packages that you need pre-installed; this is how we make the tutorials easy to use.

However, one day, you may want to use R outside of these tutorials. When that day comes, you’ll want to remember which packages to download to acquire the functions you use here. Throughout the tutorials, I will try to make it as clear as possible where each function comes from!

If you’d like to learn more about installing R packages (or R or the RStudio IDE), the Set Up video tutorial walks you through the process of setting up R on your own computer.

1.1.5.2 Where to from here

Congratulations! You can use the ggplot2 code template to plot any dataset in many different ways. As you begin exploring data, you should incorporate these tools into your workflow.

There is much more to ggplot2 and Data Visualization than we have covered here. If you would like to learn more about visualizing data with ggplot2, check out RStudio’s primer on Data Visualization.

Your new data visualization skills will make it easier to learn other parts of R, because you can now visualize the results of any change that you make to data. you’ll put these skills to immediate use in the next tutorial, which will show you how to extract values from datasets, as well as how to compute new variables and summary statistics from your data. See you there.

1.2 Programming Basics

This tutorial demystifies programming with R. Here, you’ll learn how to run functions and build objects.

1.2.1 Welcome

1.2.1.1 Welcome to R

R is easiest to use when you know how the R language works. This tutorial will teach you the implicit background knowledge that informs every piece of R code. You’ll learn about:

  • functions and their arguments objects
  • R’s basic data types
  • R’s basic data structures including vectors and lists
  • R’s package system

1.2.2 Functions

1.2.2.1 Functions

Video: https://vimeo.com/220490105

1.2.2.2 Run a function

Can you use the sqrt() function in the chunk below to compute the square root of 962?

sqrt(962)
## [1] 31.01612

1.2.2.3 Code

Use the code chunk below to examine the code that sqrt() runs.

sqrt
## function (x)  .Primitive("sqrt")
"Good job! sqrt immediately triggers a low level algorithm optimized for performance, so there is not much code to see."

1.2.2.4 lm

Compare the code in sqrt() to the code in another R function, lm(). Examine lm()’s code body in the chunk below.

lm
## function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## {
##     ret.x <- x
##     ret.y <- y
##     cl <- match.call()
##     mf <- match.call(expand.dots = FALSE)
##     m <- match(c("formula", "data", "subset", "weights", "na.action", 
##         "offset"), names(mf), 0L)
##     mf <- mf[c(1L, m)]
##     mf$drop.unused.levels <- TRUE
##     mf[[1L]] <- quote(stats::model.frame)
##     mf <- eval(mf, parent.frame())
##     if (method == "model.frame") 
##         return(mf)
##     else if (method != "qr") 
##         warning(gettextf("method = '%s' is not supported. Using 'qr'", 
##             method), domain = NA)
##     mt <- attr(mf, "terms")
##     y <- model.response(mf, "numeric")
##     w <- as.vector(model.weights(mf))
##     if (!is.null(w) && !is.numeric(w)) 
##         stop("'weights' must be a numeric vector")
##     offset <- model.offset(mf)
##     mlm <- is.matrix(y)
##     ny <- if (mlm) 
##         nrow(y)
##     else length(y)
##     if (!is.null(offset)) {
##         if (!mlm) 
##             offset <- as.vector(offset)
##         if (NROW(offset) != ny) 
##             stop(gettextf("number of offsets is %d, should equal %d (number of observations)", 
##                 NROW(offset), ny), domain = NA)
##     }
##     if (is.empty.model(mt)) {
##         x <- NULL
##         z <- list(coefficients = if (mlm) matrix(NA_real_, 0, 
##             ncol(y)) else numeric(), residuals = y, fitted.values = 0 * 
##             y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w != 
##             0) else ny)
##         if (!is.null(offset)) {
##             z$fitted.values <- offset
##             z$residuals <- y - offset
##         }
##     }
##     else {
##         x <- model.matrix(mt, mf, contrasts)
##         z <- if (is.null(w)) 
##             lm.fit(x, y, offset = offset, singular.ok = singular.ok, 
##                 ...)
##         else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok, 
##             ...)
##     }
##     class(z) <- c(if (mlm) "mlm", "lm")
##     z$na.action <- attr(mf, "na.action")
##     z$offset <- offset
##     z$contrasts <- attr(x, "contrasts")
##     z$xlevels <- .getXlevels(mt, mf)
##     z$call <- cl
##     z$terms <- mt
##     if (model) 
##         z$model <- mf
##     if (ret.x) 
##         z$x <- x
##     if (ret.y) 
##         z$y <- y
##     if (!qr) 
##         z$qr <- NULL
##     z
## }
## <bytecode: 0x1301bcc28>
## <environment: namespace:stats>

1.2.2.5 help pages

Wow! lm() runs a lot of code. What does it do? Open the help page for lm() in the chunk below and find out.

? lm
lm {stats}  R Documentation
Fitting Linear Models
Description
lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).

Usage
lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)
Arguments
formula 
an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.

data    
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.

subset  
an optional vector specifying a subset of observations to be used in the fitting process.

weights 
an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used. See also ‘Details’,

na.action   
a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.

method  
the method to be used; for fitting, currently only method = "qr" is supported; method = "model.frame" returns the model frame (the same as with model = TRUE, see below).

model, x, y, qr 
logicals. If TRUE the corresponding components of the fit (the model frame, the model matrix, the response, the QR decomposition) are returned.

singular.ok 
logical. If FALSE (the default in S but not in R) a singular fit is an error.

contrasts   
an optional list. See the contrasts.arg of model.matrix.default.

offset  
this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be NULL or a numeric vector or matrix of extents matching those of the response. One or more offset terms can be included in the formula instead or as well, and if more than one are specified their sum is used. See model.offset.

... 
additional arguments to be passed to the low level regression fitting functions (see below).

Details
Models for lm are specified symbolically. A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

If the formula includes an offset, this is evaluated and subtracted from the response.

If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix.

See model.matrix for some further details. The terms in the formula will be re-ordered so that main effects come first, followed by the interactions, all second-order, all third-order and so on: to avoid this pass a terms object as the formula (see aov and demo(glm.vr) for an example).

A formula has an implied intercept term. To remove this use either y ~ x - 1 or y ~ 0 + x. See formula for more details of allowed formulae.

Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized). However, in the latter case, notice that within-group variation is not used. Therefore, the sigma estimate and residual degrees of freedom may be suboptimal; in the case of replication weights, even wrong. Hence, standard errors and analysis of variance tables should be treated with care.

lm calls the lower level functions lm.fit, etc, see below, for the actual numerical computations. For programming only, you may consider doing likewise.

All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.

Value
lm returns an object of class "lm" or for multiple responses of class c("mlm", "lm").

The functions summary and anova are used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.

An object of class "lm" is a list containing at least the following components:

coefficients    
a named vector of coefficients

residuals   
the residuals, that is response minus fitted values.

fitted.values   
the fitted mean values.

rank    
the numeric rank of the fitted linear model.

weights 
(only for weighted fits) the specified weights.

df.residual 
the residual degrees of freedom.

call    
the matched call.

terms   
the terms object used.

contrasts   
(only where relevant) the contrasts used.

xlevels 
(only where relevant) a record of the levels of the factors used in fitting.

offset  
the offset used (missing if none were used).

y   
if requested, the response used.

x   
if requested, the model matrix used.

model   
if requested (the default), the model frame used.

na.action   
(where relevant) information returned by model.frame on the special handling of NAs.

In addition, non-null fits will have components assign, effects and (unless not requested) qr relating to the linear fit, for use by extractor functions such as summary and effects.

Using time series
Considerable care is needed when using lm with time series.

Unless na.action = NULL, the time series attributes are stripped from the variables before the regression is done. (This is necessary as omitting NAs would invalidate the time series attributes, and if NAs are omitted in the middle of the series the result would no longer be a regular time series.)

Even if the time series attributes are retained, they are not used to line up series, so that the time shift of a lagged or differenced regressor would be ignored. It is good practice to prepare a data argument by ts.intersect(..., dframe = TRUE), then apply a suitable na.action to that data frame and call lm with na.action = NULL so that residuals and fitted values are time series.

Note
Offsets specified by offset will not be included in predictions by predict.lm, whereas those specified by an offset term in the formula will be.

Author(s)
The design was inspired by the S function of the same name described in Chambers (1992). The implementation of model formula by Ross Ihaka was based on Wilkinson & Rogers (1973).

References
Chambers, J. M. (1992) Linear models. Chapter 4 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.

Wilkinson, G. N. and Rogers, C. E. (1973). Symbolic descriptions of factorial models for analysis of variance. Applied Statistics, 22, 392–399. doi: 10.2307/2346786.

See Also
summary.lm for summaries and anova.lm for the ANOVA table; aov for a different interface.

The generic functions coef, effects, residuals, fitted, vcov.

predict.lm (via predict) for prediction, including confidence and prediction intervals; confint for confidence intervals of parameters.

lm.influence for regression diagnostics, and glm for generalized linear models.

The underlying low level functions, lm.fit for plain, and lm.wfit for weighted regression fitting.

More lm() examples are available e.g., in anscombe, attitude, freeny, LifeCycleSavings, longley, stackloss, swiss.

biglm in package biglm for an alternative way to fit linear models to large datasets (especially those with many cases).

Examples
require(graphics)

## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
lm.D90 <- lm(weight ~ group - 1) # omitting intercept

anova(lm.D9)
summary(lm.D90)

opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(lm.D9, las = 1)      # Residuals, Fitted, ...
par(opar)

### less simple examples in "See Also" above
[Package stats version 4.0.3 Index]
"Good job! `lm()` is R's function for fitting basic linear models. No wonder it runs so much code.

1.2.2.6 Code comments

What do you think the chunk below will return? Run it and see. The result should be nothing. R will not run anything on a line after a # symbol. This is useful because it lets you write human readable comments in your code: just place the comments after a #. Now delete the # and re-run the chunk. You should see a result.

# sqrt(961)
sqrt(961)
## [1] 31

1.2.3 Arguments

1.2.3.1 Arguments

Video: https://vimeo.com/220490157

1.2.3.2 args()

rnorm() is a function that generates random variables from a normal distribution. Find the arguments of rnorm()

args(rnorm)
## function (n, mean = 0, sd = 1) 
## NULL
"Good job! `n` specifies the number of random normal variables to generate. `mean` and `sd` describe the distribution to generate the random values with."

1.2.3.3 optional arguments

Which arguments of R norm are optional?

  • n ✗
  • mean ✓
  • sd ✓

n is not an optional argument because it does not have a default value.

1.2.3.4 rnorm() 1

Use rnrom() to generate 100 random normal values with a mean of 100 and a standard deviation of 15.

rnorm(100, mean = 100, sd = 15)
##   [1] 106.77717  96.64706  94.27992  96.79427  83.51553 110.83675  84.00723
##   [8] 109.56798  96.76815  88.90998  90.21935 107.48061 127.74486 122.25443
##  [15] 101.46010 103.76057  92.56121 124.89257  91.54704 111.86213 100.27495
##  [22] 126.79548  77.56887  81.47184  96.89119  95.15990 120.12624 115.44153
##  [29]  95.59624  72.88438  92.43411 125.50980  97.08877 105.41194  90.02012
##  [36] 115.76164  96.75034  97.22880 132.41666  96.00209 102.58404  83.19374
##  [43] 103.83508 102.02852 102.47443  88.20627 115.23612  98.91257  97.92002
##  [50] 107.63582  95.99832  89.69415 105.59018  97.78842  97.71702  95.59797
##  [57]  75.44058 105.78448  96.04326 107.51356  97.96445  95.35418 131.77878
##  [64] 100.80586 106.23207 109.35130 104.02730  75.68339 109.57379  92.17855
##  [71]  95.37452  85.55911 134.30015 132.33867  96.90743 112.53948  99.97960
##  [78]  86.59587  84.53053  97.75938  68.35060  89.29400 111.79914 120.81460
##  [85] 130.65068 114.66434 105.76461  93.31965 113.09996  81.32628 106.24608
##  [92]  92.41039  84.21390 125.70457  94.96537  95.86193 125.58928  69.98398
##  [99] 101.56256 107.50966

1.2.3.5 rnorm() 2

Can you spot the error in the code below? Fix the code and then re-run it.

rnorm(100, mu = 100, sd = 50)
rnorm(100, mean = 100, sd = 50)
##   [1]  23.047135 194.939609  34.226652 168.869725  94.919515 104.062345
##   [7] 126.053860 152.530388 157.177136  72.605948  83.767693 139.228550
##  [13] 104.157393  67.346849 101.156752 133.313704  12.633888 151.727114
##  [19]  35.080397  88.844966 123.158523  32.803794  70.592004  30.095346
##  [25] 116.775865 115.686451 114.683681  96.252669 108.933987  83.146571
##  [31] 157.915458  47.543798  47.479208  42.485658 139.542386  17.188037
##  [37] 120.415898 126.567071  81.390605  68.205015  32.445949  50.481090
##  [43] 181.586240  66.049696 160.215740  37.961944  62.128938 144.913464
##  [49] 124.586252 108.620969 122.775460  88.737818  70.901842 166.739024
##  [55]  86.633994 129.586291  78.467321  77.709106 130.581754  28.166392
##  [61]  53.946862  78.295532 218.691506 159.565607  58.587636 133.599970
##  [67]  60.483692  27.036205  45.483835 183.997753 225.351244 156.565826
##  [73]  87.168412 151.186104  69.423703  84.847331 110.217830 103.992208
##  [79] 165.912385  42.901253 132.188116  73.574976  75.011586  50.489967
##  [85]  84.461263 129.312759 147.992591 163.795431  20.237342  87.143155
##  [91] -45.531370  90.327897  44.753952  46.746033  44.468302 132.605460
##  [97]  -7.850684 197.588639 130.041100 125.700315

1.2.4 Objects

1.2.4.1 Objects

Video: https://vimeo.com/220493412

1.2.4.2 Object names

You can choose almost any name you like for an object, as long as the name does not begin with a number or a special character like +, -, *, /, ^, !, @, or &.

Which of these would be valid object names?

  • today ✓
  • 1st ✗
  • +1 ✗
  • vars ✓
  • ^_^ ✗
  • foo ✓
Remember that the most helpful names will remind you what you put in your object.

1.2.4.3 Using objects

In the code chunk below, save the results of rnorm(100, mean = 100, sd = 15) to an object named data. Then, on a new line, call the hist() function on data to plot a histogram of the random values.

data <- rnorm(100, mean = 100, sd = 15)
hist(data)

1.2.4.4 What if?

What do you think would happen if you assigned data to a new object named copy, like this? Run the code and then inspect both data and copy.

data <- rnorm(100, mean = 100, sd = 15)
copy <- data
identical(copy, data)
## [1] TRUE
"Good job! R saves a copy of the contents in data to copy."

1.2.4.5 Data sets

Objects provide an easy way to store data sets in R. In fact, R comes with many toy data sets pre-loaded. Examine the contents of iris to see a classic toy data set. Hint: how could you learn more about the iris object?

iris
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows
"Good job! You can learn more about iris by examining its help page with `?iris`."

1.2.4.6 rm()

What if you accidentally overwrite an object? If that object came with R or one of its packages, you can restore the original version of the object by removing your version with rm(). Run rm() on iris below to restore the iris data set.

iris <- 1
iris
## [1] 1
rm(iris)
iris
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows
"Good job! Unfortunately, `rm()` cannot help you if you overwrite one of your own objects."

1.2.5 Vectors

Video: https://vimeo.com/220490316

1.2.5.1 Create a vector

In the chunk below, create a vector that contains the integers from one to ten.

c(1,2,3,4,5,6,7,8,9,10)
##  [1]  1  2  3  4  5  6  7  8  9 10

If your vector contains a sequence of contiguous integers, you can create it with the : shortcut. Run 1:10 in the chunk below. What do you get? What do you suppose 1:20 would return?

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

You can extract any element of a vector by placing a pair of brackets behind the vector. Inside the brackets place the number of the element that you’d like to extract. For example, vec[3] would return the third element of the vector named vec.

Use the chunk below to extract the fourth element of vec.

vec <- c(1, 2, 4, 8, 16)
vec[4]
## [1] 8

1.2.5.2 More []

You can also use [] to extract multiple elements of a vector. Place the vector c(1,2,5) between the brackets below. What does R return?

vec <- c(1, 2, 4, 8, 16)
vec[]
vec <- c(1, 2, 4, 8, 16)
vec[c(1,2,5)]
## [1]  1  2 16

1.2.5.3 Names

If the elements of your vector have names, you can extract them by name. To do so place a name or vector of names in the brackets behind a vector. Surround each name with quotation marks, e.g. vec2[c(“alpha”, “beta”)].

Extract the element named gamma from the vector below.

vec2 <- c(alpha = 1, beta = 2, gamma = 3)
vec2 <- c(alpha = 1, beta = 2, gamma = 3)
vec2["gamma"]
## gamma 
##     3

1.2.5.4 Vectorised operations

Predict what the code below will return. Then look at the result.

c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) + c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
##  [1]  2  4  6  8 10 12 14 16 18 20
"Good job! Like many R functions, R's math operators are vectorised: they're designed to work with vectors by repeating the operation for each pair of elements."

1.2.5.5 Vector recycling

Predict what the code below will return. Then look at the result.

1 + c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
##  [1]  2  3  4  5  6  7  8  9 10 11
"Good job! Whenever you try to work with vectors of varying lengths (recall that `1` is a vector of length one), R will repeat the shorter vector as needed to compute the result."

1.2.6 Types

1.2.6.1 Types

Video: https://vimeo.com/220490241

1.2.6.2 Atomic types

Which of these is not an atomic data type

  • complex ✗
  • raw ✗
  • logical ✗
  • integer ✗
  • simple ✓
  • character ✗
  • numeric/double ✗

1.2.6.3 What type?

What type of data is “1L”

  • numeric/double ✗
  • integer ✗
  • character ✓
  • logical ✗

1.2.6.4 Integers

Create a vector of integers from one to five. Can you imagine why you might want to use integers instead of numbers/doubles?

c(1L, 2L, 3L, 4L, 5L)
## [1] 1 2 3 4 5

1.2.6.5 Floating point arithmetic

Computers must use a finite amount of memory to store decimal numbers (which can sometimes require infinite precision). As a result, some decimals can only be saved as very precise approximations. From time to time you’ll notice side effects of this imprecision, like below.

Compute the square root of two,square the answer (e.g. multiply the square root of two by the square root of two), and then subtract two from the result. What answer do you expect? What answer do you get?

sqrt(2)^2 - 2
## [1] 4.440892e-16

1.2.6.6 Vectors

How many types of data can you put into a single vector?

  • 1 ✓
  • 6 ✗
  • As many as you like ✗

1.2.6.7 Character or object?

One of the most common mistakes in R is to call an object when you mean to call a character string and vice versa.

Which of these are object names? What is the difference between object names and character strings?

  • foo ✓
  • “num” ✗
  • mu ✓
  • “sigma” ✗
  • “data” ✗
  • a ✓
Character strings are surrounded by quotation marks, object names are not.

1.2.7 Lists

1.2.7.1 Lists

Video: https://vimeo.com/220490360

1.2.7.2 Lists vs. vectors

Which data structure(s) could you use to store these pieces of data in the same object? 1001, TRUE, “stories”.

  • a vector ✗
  • a list ✓
  • neither ✗

1.2.7.3 Make a list

Make a list that contains the elements 1001, TRUE, and “stories”. Give each element a name.

list(num = 1001, logic = TRUE, char = "stories")
## $num
## [1] 1001
## 
## $logic
## [1] TRUE
## 
## $char
## [1] "stories"

1.2.7.4 Extract an element

Extract the number 1001 from the list below.

things <- list(number = 1001, logical = TRUE, string = "stories")
things <- list(number = 1001, logical = TRUE, string = "stories")
things$number
## [1] 1001

1.2.7.5 Data Frames

You can make a data frame with the data.frame() function, which works similar to c(), and list(). Assemble the vectors below into a data frame with the column names numbers, logicals, strings.

nums <- c(1, 2, 3, 4)
logs <- c(TRUE, TRUE, FALSE, TRUE)
strs <- c("apple", "banana", "carrot", "duck")
nums <- c(1, 2, 3, 4)
logs <- c(TRUE, TRUE, FALSE, TRUE)
strs <- c("apple", "banana", "carrot", "duck")
data.frame(numbers = nums, logicals = logs, strings = strs)
## # A tibble: 4 × 3
##   numbers logicals strings
##     <dbl> <lgl>    <chr>  
## 1       1 TRUE     apple  
## 2       2 TRUE     banana 
## 3       3 FALSE    carrot 
## 4       4 TRUE     duck
"Good Job. When you make a data frame, you must follow one rule: each column vector should be the same length."

1.2.7.6 Extract a column

Given that a data frame is a type of list (with named elements), how could you extract the strings column of the df data frame below? Do it.

nums <- c(1, 2, 3, 4)
logs <- c(TRUE, TRUE, FALSE, TRUE)
strs <- c("apple", "banana", "carrot", "duck")
df <- data.frame(numbers = nums, logicals = logs, strings = strs)
nums <- c(1, 2, 3, 4)
logs <- c(TRUE, TRUE, FALSE, TRUE)
strs <- c("apple", "banana", "carrot", "duck")
df <- data.frame(numbers = nums, logicals = logs, strings = strs)
df$strings
## [1] "apple"  "banana" "carrot" "duck"

1.2.8 Packages

1.2.8.1 Packages

Video: https://vimeo.com/220490447

1.2.8.2 A common error

What does this common error message suggest? object _____ does not exist.

  • You misspelled your object name ✗
  • You’ve forgot to load the package that ____ comes in ✗
  • Either ✓

1.2.8.3 Load a package

In the code chunk below, load the tidyverse package. Whenever you load a package R will also load all of the packages that the first package depends on. tidyverse takes advantage of this to create a shortcut for loading several common packages at once. Whenever you load tidyverse, tidyverse also loads ggplot2, dplyr, tibble, tidyr, readr, and purrr.

library(tidyverse)
"Good job! R will keep the packages loaded until you close your R session. When you re-open R, you'll need to reload you packages."
Last value being used to check answer is invisible. See `?invisible` for more information

1.2.8.4 Quotes

Did you know, library() is a special function in R? You can pass library() a package name in quotes, like library(“tidyverse”), or not in quotes, like library(tidyverse)—both will work! That’s often not the case with R functions.

In general, you should always use quotes unless you are writing the name of something that is already loaded into R’s memory, like a function, vector, or data frame.

1.2.8.5 Install packages

But what if the package that you want to load is not installed on your computer? How would you install the dplyr package on your own computer?

install.packages("dplyr")
unable to install packages

1.2.8.6 Congratulations!

Congratulations. You now have a formal sense for how the basics of R work. Although you may think of your self as a Data Scientist, this brief Computer Science background will help you as you analyze data. Whenever R does something unexpected, you can apply your knowledge of how R works to figure out what went wrong.

2 Work with Data

Learn the most important data handling skills in R: how to extract values from a table, subset tables, calculate summary statistics, and derive new variables.

If you’re ready to begin, go to the first tutorial. There is no need to install or download anything. Each tutorial has everything you need to write and run R code, right in the tutorial.

2.1 Working with Tibbles

Learn to use tibbles, the most user-friendly tabular data structure in R, as well as how to manage tidyverse packages with… the tidyverse package.

2.1.1 Welcome

In this primer, you will explore the popularity of different names over time. To succeed, you will need to master some common tools for manipulating data with R:

  • tibbles and View(), which let you inspect raw data
  • select() and filter(), which let you extract rows and columns from a data frame
  • arrange(), which lets you reorder the rows in your data
  • %>%, which organizes your code into reader-friendly “pipes”
  • mutate(), group_by(), and summarize(), which help you use your data to compute new variables and summary statistics

These are some of the most useful R functions for data science, and the tutorials that follow will provide you everything you need to learn them.

In the tutorials, we’ll use a dataset named babynames, which comes in a package that is also named babynames. Within babynames, you will find information about almost every name given to children in the United States since 1880.

This tutorial introduces babynames as well as a new data structure that makes working with data in R easy: the tibble.

In addition to babynames, this tutorial uses the core tidyverse packages, including ggplot2, tibble, and dplyr. All of these packages have been pre-installed for your convenience. But they haven’t been pre-loaded—something you will soon learn more about!

Click the Next Topic button to begin.

2.1.2 babynames

Package

2.1.2.1 Loading babynames

Before we begin, let’s learn a little about our data. The babynames dataset comes in the babynames package. The package is pre-installed for you, just as ggplot2 was pre-installed in the last tutorial. But unlike in the last tutorial, I have not pre-loaded babynames, or any other package.

What does this mean? In R, whenever you want to use a package that is not part of base R, you need to load the package with the command library(). Until you load a package, R will not be able to find the datasets and functions contained in the package. For example, if we asked R to display the babynames dataset, which comes in the babynames package, right now, we’d get the message below. R cannot find the dataset because we haven’t loaded the babynames package.

## Error in eval(expr, envir, enclos): object 'babynames' not found

To load the babynames package, you would run the command library(babynames). After you load a package, R will be able to find its contents until you close R. The next time you open R, you will need to reload the package if you wish to use it again.

This might sound like an inconvenience, but choosing which packages to load keeps your R experience simple and orderly.

In the chunk below, load babynames (the package) and then open the help page for babynames (the data set). Be sure to read the help page before going on.

library(babynames)

2.1.2.2 The data

Now that you know a little about the dataset, let’s examine its contents. If you were to run babynames at your R console, you would get output that looks like this:

babynames
## # A tibble: 1,924,665 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows

Yikes. What is happening?

2.1.2.3 Displaying large data

babynames is a large data frame, and R is not well equiped to display the contents of large data frames. R shows as many rows as possible before your memory buffer is overwhelmed. At that point, R stops, leaving you to look at an arbitrary section of your data.

You can avoid this behaviour by transforming your data frame to a tibble.

2.1.3 tibbles

2.1.3.1 What is a tibble?

A tibble is a special type of table. R displays tibbles in a refined way whenever you have the tibble package loaded: R will print only the first ten rows of a tibble as well as all of the columns that fit into your console window. R also adds useful summary information about the tibble, such as the data types of each column and the size of the data set.

Whenever you do not have the tibble packages loaded, R will display the tibble as if it were a data frame. In fact, tibbles are data frames, an enhanced type of data frame.

You can think of the difference between the data frame display and the tibble display like this:

image #### as_tibble()

You can transform a data frame to a tibble with the as_tibble() function in the tibble package, e.g. as_tibble(cars). However, babynames is already a tibble. To display it nicely, you just need to load the tibble package.

To see what I mean, use library() to load the tibble package in the chunk below and then call babynames.

library(tibble)
babynames
## # A tibble: 1,924,665 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows
"Excellent! If you want to check whether or not an object is a tibble, you can use the `is_tibble()` function that comes in the tibble package. For example, this would return TRUE: `is_tibble(babynames)`."

You do not need to worry much about tibbles in these tutorials; in future tutorials, I’ll automatically convert each data frame into an interactive table. However, you should consider making tibbles an important part of your work in R.

2.1.3.2 View()

What if you’d like to inspect the remaining portions of a tibble? To see the entire tibble, use the View() command. R will launch a window that shows a scrollable display of the entire data set. For example, the code below will launch a data viewer in the RStudio IDE.

View(babynames)

View() works in conjunction with the software that you run R from: View() opens the data editor provided by that software. Unfortunately, this tutorial doesn’t come with a data editor, so you won’t be able to use View() today (unless you open the RStudio IDE, for example, and run the code there).

2.1.4 tidyverse

2.1.4.1 The tidyverse

The tibble package is one of several packages that are known collectively as “the tidyverse”. Tidyverse packages share a common philosophy and are designed to work well together. For example, in this tutorial you will use the tibble package, the ggplot2 package, and the dplyr package, all of which belong to the tidyverse.

2.1.4.2 The tidyverse package

When you use tidyverse packages, you can make your life easier by using the tidyverse package. The tidyverse package provides a shortcut for installing and loading the entire suite of packages in “the tidyverse”, e.g. 

2.1.4.3 Installing the tidyverse

Think of the tidyverse package as a placeholder for the packages that are in the “tidyverse”. By itself, tidyverse does not do much, but when you install the tidyverse package it instructs R to install every other package in the tidyverse at the same time. In other words, when you run install.packages(“tidyverse”), R installs the following packages for you in one simple step:

  • ggplot2
  • dplyr
  • tidyr
  • readr
  • purrr
  • tibble
  • hms
  • stringr
  • lubridate
  • forcats
  • DBI
  • haven
  • jsonlite
  • readxl
  • rvest
  • xml2
  • modelr
  • broom

2.1.4.4 loading the tidyverse

When you load tidyverse with library(“tidyverse”), it instructs R to load the most commonly used tidyverse packages. These are:

  • ggplot2
  • dplyr
  • tidyr
  • readr
  • purrr
  • tibble

You can load the less commonly used tidyverse packages in the normal way, by running library() for each of them.

Let’s give this a try. We will use the ggplot2 and dplyr packages later in this tutorial. Let’s use the tidyverse package to load them in the chunk below:

library(tidyverse)

2.1.4.5 Quiz

Which package is not loaded by library(“tidyverse”)

  • ggplot2 ✗
  • dplyr ✗
  • tibble ✗
  • babynames
Correct! 

Now that you are familiar with the data set, and have loaded the necessary packages, let's explore the data.

2.1.4.6 Recap

Tibbles and the tidyverse package are two tools that make life with R easier. Ironically, you may not come to appreciate their value right away: these tutorials pre-load packages for you, and they wrap data frames into an interactive table for display (at least the tutorials in the primers that follow will). However, you will want to utilize tibbles and the tidyverse package when you move out of the tutorials and begin doing your own work with R inside of the RStudio IDE.

This tutorial also introduced the babynames dataset. In the next tutorial, you will use this data set to plot the popularity of your name over time. Along the way, you will learn how to filter and subset data sets in R.

2.2 Isolating Data with dplyr

Master three simple functions for finding, and extracting, the data in your data set. Here you will learn to select variables, filter observations, and arrange values. Here, you will also meet R’s pipe operator, %>%.

2.2.1 Welcome

In this case study, you will explore the popularity of your own name over time. Along the way, you will master some of the most useful functions for isolating variables, cases, and values within a data frame:

  • select() and filter(), which let you extract rows and columns from a data frame
  • arrange(), which lets you reorder the rows in your data
  • %>%, which organizes your code into reader-friendly “pipes”

This tutorial uses the core tidyverse packages, including ggplot2, tibble, and dplyr, as well as the babynames package. All of these packages have been pre-installed and pre-loaded for your convenience.

Click the Next Topic button to begin.

2.2.2 Your name

The history of your name

You can use the data in babynames to make graphs like this, which reveal the history of a name, perhaps your name.

image But before you do, you will need to trim down babynames. At the moment, there are more rows in babynames than you need to build your plot.

2.2.2.1 An example

To see what I mean, consider how I made the plot above: I began with the entire data set, which if plotted as a scatterplot would’ve looked like this.

image I then narrowed the data to just the rows that contain my name, before plotting the data with a line geom. Here’s how the rows with just my name look as a scatterplot.

image If I had skipped this step, my line graph would’ve connected all of the points in the large data set, creating an uninformative graph.

image Your goal in this section is to repeat this process for your own name (or a name that you choose). Along the way, you will learn a set of functions that isolate information within a data set.

2.2.2.2 Isolating data

This type of task occurs often in Data Science: you need to extract data from a table before you can use it. You can do this task quickly with three functions that come in the dplyr package:

  • select() - which extracts columns from a data frame
  • filter() - which extracts rows from a data frame
  • arrange() - which moves important rows to the top of a data frame

Each function takes a data frame or tibble as it’s first argument and returns a new data frame or tibble as its output.

2.2.3 select()

select() extracts columns of a data frame and returns the columns as a new data frame. To use select(), pass it the name of a data frame to extract columns from, and then the names of the columns to extract. The column names do not need to appear in quotation marks or be prefixed with a $; select() knows to find them in the data frame that you supply.

2.2.3.1 Exercise - select()

Use the example below to get a feel for select(). Can you extract just the name column? How about the name and year columns? How about all of the columns except prop?

select(babynames, name, sex)
## # A tibble: 1,924,665 × 2
##    name      sex  
##    <chr>     <chr>
##  1 Mary      F    
##  2 Anna      F    
##  3 Emma      F    
##  4 Elizabeth F    
##  5 Minnie    F    
##  6 Margaret  F    
##  7 Ida       F    
##  8 Alice     F    
##  9 Bertha    F    
## 10 Sarah     F    
## # … with 1,924,655 more rows
# Can you extract just the name column?
select(babynames, name)
## # A tibble: 1,924,665 × 1
##    name     
##    <chr>    
##  1 Mary     
##  2 Anna     
##  3 Emma     
##  4 Elizabeth
##  5 Minnie   
##  6 Margaret 
##  7 Ida      
##  8 Alice    
##  9 Bertha   
## 10 Sarah    
## # … with 1,924,655 more rows
# How about the name and year columns?
select(babynames, name, year)
## # A tibble: 1,924,665 × 2
##    name       year
##    <chr>     <dbl>
##  1 Mary       1880
##  2 Anna       1880
##  3 Emma       1880
##  4 Elizabeth  1880
##  5 Minnie     1880
##  6 Margaret   1880
##  7 Ida        1880
##  8 Alice      1880
##  9 Bertha     1880
## 10 Sarah      1880
## # … with 1,924,655 more rows
# How about all of the columns except prop?
select(babynames, -prop)
## # A tibble: 1,924,665 × 4
##     year sex   name          n
##    <dbl> <chr> <chr>     <int>
##  1  1880 F     Mary       7065
##  2  1880 F     Anna       2604
##  3  1880 F     Emma       2003
##  4  1880 F     Elizabeth  1939
##  5  1880 F     Minnie     1746
##  6  1880 F     Margaret   1578
##  7  1880 F     Ida        1472
##  8  1880 F     Alice      1414
##  9  1880 F     Bertha     1320
## 10  1880 F     Sarah      1288
## # … with 1,924,655 more rows

2.2.3.2 select() helpers

You can also use a series of helpers with select(). For example, if you place a minus sign before a column name, select() will return every column but that column. Can you predict how the minus sign will work here?

select(babynames, -c(n, prop))
## # A tibble: 1,924,665 × 3
##     year sex   name     
##    <dbl> <chr> <chr>    
##  1  1880 F     Mary     
##  2  1880 F     Anna     
##  3  1880 F     Emma     
##  4  1880 F     Elizabeth
##  5  1880 F     Minnie   
##  6  1880 F     Margaret 
##  7  1880 F     Ida      
##  8  1880 F     Alice    
##  9  1880 F     Bertha   
## 10  1880 F     Sarah    
## # … with 1,924,655 more rows

The table below summarizes the other select() helpers that are available in dplyr. Study it, and then click “Continue” to test your understanding.

Helper Function Use Example
- Columns except select(babynames, -prop)
: Columns between (inclusive) select(babynames, year:n)
contains() Columns that contains a string select(babynames, contains(“n”))
ends_with() Columns that ends with a string select(babynames, ends_with(“n”))
matches() Columns that matches a regex select(babynames, matches(“n”))
num_range() Columns with a numerical suffix in the range Not applicable with babynames
one_of() Columns whose name appear in the given set select(babynames, one_of(c(“sex”, “gender”)))
starts_with() Columns that starts with a string select(babynames, starts_with(“n”))

2.2.3.3 select() quiz

Which of these is not a way to select the name and n columns together?

  • select(babynames, -c(year, sex, prop)) ✗
  • select(babynames, name:n) ✗
  • select(babynames, starts_with(“n”)) ✗
  • select(babynames, ends_with(“n”)) ✓

2.2.4 filter()

filter() extracts rows from a data frame and returns them as a new data frame. As with select(), the first argument of filter() should be a data frame to extract rows from. The arguments that follow should be logical tests; filter() will return every row for which the tests return TRUE.

2.2.4.1 filter in action

For example, the code chunk below returns every row with the name “Sea” in babynames.

filter(babynames, name == "Sea")
## # A tibble: 4 × 5
##    year sex   name      n       prop
##   <dbl> <chr> <chr> <int>      <dbl>
## 1  1982 F     Sea       5 0.00000276
## 2  1985 M     Sea       6 0.00000312
## 3  1986 M     Sea       5 0.0000026 
## 4  1998 F     Sea       5 0.00000258

2.2.4.2 Logical tests

To get the most from filter, you will need to know how to use R’s logical test operators, which are summarised below.

Logical operator tests Example
> Is x greater than y? x > y
>= Is x greater than or equal to y? x >= y
< Is x less than y? x < y
<= Is x less than or equal to y? x <= y
== Is x equal to y? x == y
!= Is x not equal to y? x != y
is.na() Is x an NA? is.na(x)
!is.na() Is x not an NA? !is.na(x)

2.2.4.3 Exercise - Logical Operators

See if you can use the logical operators to manipulate our code below to show:

  • All of the names where prop is greater than or equal to 0.08
  • All of the children named “Khaleesi”
  • All of the names that have a missing value for n (Hint: this should return an empty data set).
filter(babynames, name == "Sea")
## # A tibble: 4 × 5
##    year sex   name      n       prop
##   <dbl> <chr> <chr> <int>      <dbl>
## 1  1982 F     Sea       5 0.00000276
## 2  1985 M     Sea       6 0.00000312
## 3  1986 M     Sea       5 0.0000026 
## 4  1998 F     Sea       5 0.00000258

2.2.4.4 Two common mistakes

When you use logical tests, be sure to look out for two common mistakes. One appears in each code chunk below. Can you find them? When you spot a mistake, fix it and then run the chunk to confirm that it works.

filter(babynames, name = "Sea")
filter(babynames, name == "Sea")
## # A tibble: 4 × 5
##    year sex   name      n       prop
##   <dbl> <chr> <chr> <int>      <dbl>
## 1  1982 F     Sea       5 0.00000276
## 2  1985 M     Sea       6 0.00000312
## 3  1986 M     Sea       5 0.0000026 
## 4  1998 F     Sea       5 0.00000258
"Good Job! Remember to use == instead of = when testing for equality."
filter(babynames, name == Sea)
filter(babynames, name == "Sea")
## # A tibble: 4 × 5
##    year sex   name      n       prop
##   <dbl> <chr> <chr> <int>      <dbl>
## 1  1982 F     Sea       5 0.00000276
## 2  1985 M     Sea       6 0.00000312
## 3  1986 M     Sea       5 0.0000026 
## 4  1998 F     Sea       5 0.00000258
"Good Job! As written this code would check that name is equal to the contents of the object named Sea, which does not exist."

2.2.4.5 Two mistakes - Recap

When you use logical tests, be sure to look out for these two common mistakes:

  1. using = instead of == to test for equality.
  2. forgetting to use quotation marks when comparing strings, e.g. name == Abby, instead of name == “Abby”

2.2.4.6 Combining tests

If you provide more than one test to filter(), filter() will combine the tests with an and statement (&): it will only return the rows that satisfy all of the tests.

To combine multiple tests in a different way, use R’s Boolean operators. For example, the code below will return all of the children named Sea or Anemone.

filter(babynames, name == "Sea" | name == "Anemone")
## # A tibble: 5 × 5
##    year sex   name        n       prop
##   <dbl> <chr> <chr>   <int>      <dbl>
## 1  1982 F     Sea         5 0.00000276
## 2  1985 M     Sea         6 0.00000312
## 3  1986 M     Sea         5 0.0000026 
## 4  1998 F     Sea         5 0.00000258
## 5  2012 F     Anemone     6 0.0000031

2.2.4.7 Boolean operators

You can find a complete list or base R’s boolean operators in the table below.

Boolean operator represents Example
& Are both A and B true? A & B
Are one or both of A and B true? A
! Is A not true? !A
xor() Is one and only one of A and B true? xor(A, B)
%in% Is x in the set of a, b, and c? x %in% c(a, b, c)
any() Are any of A, B, or C true? any(A, B, C)
all() Are all of A, B, or C true? all(A, B, C)

2.2.4.8 Exercise - Combining tests

Use Boolean operators to alter the code chunk below to return only the rows that contain:

  • Girls named Sea
  • Names that were used by exactly 5 or 6 children in 1880
  • Names that are one of Acura, Lexus, or Yugo
filter(babynames, name == "Sea" | name == "Anemone")
## # A tibble: 5 × 5
##    year sex   name        n       prop
##   <dbl> <chr> <chr>   <int>      <dbl>
## 1  1982 F     Sea         5 0.00000276
## 2  1985 M     Sea         6 0.00000312
## 3  1986 M     Sea         5 0.0000026 
## 4  1998 F     Sea         5 0.00000258
## 5  2012 F     Anemone     6 0.0000031
# Girls named Sea
filter(babynames, sex == "F", name == "Sea")
## # A tibble: 2 × 5
##    year sex   name      n       prop
##   <dbl> <chr> <chr> <int>      <dbl>
## 1  1982 F     Sea       5 0.00000276
## 2  1998 F     Sea       5 0.00000258
# Names that were used by exactly 5 or 6 children in 1880
filter(babynames, n %in% c(5,6))
## # A tibble: 460,006 × 5
##     year sex   name        n      prop
##    <dbl> <chr> <chr>   <int>     <dbl>
##  1  1880 F     Abby        6 0.0000615
##  2  1880 F     Aileen      6 0.0000615
##  3  1880 F     Alba        6 0.0000615
##  4  1880 F     Alda        6 0.0000615
##  5  1880 F     Alla        6 0.0000615
##  6  1880 F     Alverta     6 0.0000615
##  7  1880 F     Ara         6 0.0000615
##  8  1880 F     Ardelia     6 0.0000615
##  9  1880 F     Ardella     6 0.0000615
## 10  1880 F     Arrie       6 0.0000615
## # … with 459,996 more rows
# Names that are one of Acura, Lexus, or Yugo
filter(babynames, name %in% c("Acura", "Lexus", "Yugo"))
## # A tibble: 57 × 5
##     year sex   name      n       prop
##    <dbl> <chr> <chr> <int>      <dbl>
##  1  1990 F     Lexus    36 0.0000175 
##  2  1990 M     Lexus    12 0.00000558
##  3  1991 F     Lexus   102 0.0000502 
##  4  1991 M     Lexus    16 0.00000755
##  5  1992 F     Lexus   193 0.0000963 
##  6  1992 M     Lexus    25 0.0000119 
##  7  1993 F     Lexus   285 0.000145  
##  8  1993 M     Lexus    30 0.0000145 
##  9  1994 F     Lexus   381 0.000195  
## 10  1994 F     Acura     6 0.00000308
## # … with 47 more rows

2.2.4.9 Two more common mistakes

Logical tests also invite two common mistakes that you should look out for. Each is displayed in a code chunk below, one produces an error and the other is needlessly verbose. Diagnose the chunks and then fix the code.

filter(babynames, 10 < n < 20)
filter(babynames, 10 < n, n < 20)
## # A tibble: 365,458 × 5
##     year sex   name           n     prop
##    <dbl> <chr> <chr>      <int>    <dbl>
##  1  1880 F     Antoinette    19 0.000195
##  2  1880 F     Clementine    19 0.000195
##  3  1880 F     Edythe        19 0.000195
##  4  1880 F     Harriette     19 0.000195
##  5  1880 F     Libbie        19 0.000195
##  6  1880 F     Lilian        19 0.000195
##  7  1880 F     Lue           19 0.000195
##  8  1880 F     Lutie         19 0.000195
##  9  1880 F     Magdalena     19 0.000195
## 10  1880 F     Meda          19 0.000195
## # … with 365,448 more rows
"Good job! You cannot combine two logical tests in R without using a Boolean operator (or at least a comma between filter arguments)."
filter(babynames, n == 5 | n == 6 | n == 7 | n == 8 | n == 9)
filter(babynames, n %in% 5:9)
## # A tibble: 811,195 × 5
##     year sex   name          n      prop
##    <dbl> <chr> <chr>     <int>     <dbl>
##  1  1880 F     Adela         9 0.0000922
##  2  1880 F     Althea        9 0.0000922
##  3  1880 F     Amalia        9 0.0000922
##  4  1880 F     Amber         9 0.0000922
##  5  1880 F     Angelina      9 0.0000922
##  6  1880 F     Annabelle     9 0.0000922
##  7  1880 F     Anner         9 0.0000922
##  8  1880 F     Arie          9 0.0000922
##  9  1880 F     Clarice       9 0.0000922
## 10  1880 F     Corda         9 0.0000922
## # … with 811,185 more rows
"Good job! Although the first code works, you should make your code more concise by collapsing multiple or statements into an %in% statement when possible."

2.2.4.10 Two more common mistakes - Recap

When you combine multiple logical tests, be sure to look out for these two common mistakes:

  1. Collapsing multiple logical tests into a single test without using a boolean operator
  2. Using repeated | instead of %in%, e.g. x == 1 | x == 2 | x == 3 instead of x %in% c(1, 2, 3)

2.2.5 arrange()

arrange() returns all of the rows of a data frame reordered by the values of a column. As with select(), the first argument of arrange() should be a data frame and the remaining arguments should be the names of columns. If you give arrange() a single column name, it will return the rows of the data frame reordered so that the row with the lowest value in that column appears first, the row with the second lowest value appears second, and so on. If the column contains character strings, arrange() will place them in alphabetical order.

2.2.5.1 Exercise - arrange()

Use the code chunk below to arrange babynames by n. Can you tell what the smallest value of n is?

arrange(babynames, n)
## # A tibble: 1,924,665 × 5
##     year sex   name          n      prop
##    <dbl> <chr> <chr>     <int>     <dbl>
##  1  1880 F     Adelle        5 0.0000512
##  2  1880 F     Adina         5 0.0000512
##  3  1880 F     Adrienne      5 0.0000512
##  4  1880 F     Albertine     5 0.0000512
##  5  1880 F     Alys          5 0.0000512
##  6  1880 F     Ana           5 0.0000512
##  7  1880 F     Araminta      5 0.0000512
##  8  1880 F     Arthur        5 0.0000512
##  9  1880 F     Birtha        5 0.0000512
## 10  1880 F     Bulah         5 0.0000512
## # … with 1,924,655 more rows
"Good job! The compiler of `babynames` used 5 as a cutoff; a name only made it into babynames for a given year and gender if it was used for five or more children."

2.2.5.2 Tie breakers

If you supply additional column names, arrange() will use them as tie breakers to order rows that have identical values in the earlier columns. Add to the code below, to make prop a tie breaker. The result should first order rows by value of n and then reorder rows within each value of n by values of prop.

arrange(babynames, n)
arrange(babynames, n, prop)
## # A tibble: 1,924,665 × 5
##     year sex   name            n       prop
##    <dbl> <chr> <chr>       <int>      <dbl>
##  1  2007 M     Aaban           5 0.00000226
##  2  2007 M     Aareon          5 0.00000226
##  3  2007 M     Aaris           5 0.00000226
##  4  2007 M     Abd             5 0.00000226
##  5  2007 M     Abdulazeez      5 0.00000226
##  6  2007 M     Abdulhadi       5 0.00000226
##  7  2007 M     Abdulhamid      5 0.00000226
##  8  2007 M     Abdulkadir      5 0.00000226
##  9  2007 M     Abdulraheem     5 0.00000226
## 10  2007 M     Abdulrahim      5 0.00000226
## # … with 1,924,655 more rows

2.2.5.3 desc

If you would rather arrange rows in the opposite order, i.e. from large values to small values, surround a column name with desc(). arrange() will reorder the rows based on the largest values to the smallest.

Add a desc() to the code below to display the most popular name for 2017 (the largest year in the dataset) instead of 1880 (the smallest year in the dataset).

arrange(babynames, year, desc(prop))
arrange(babynames, desc(year), desc(n))
## # A tibble: 1,924,665 × 5
##     year sex   name         n    prop
##    <dbl> <chr> <chr>    <int>   <dbl>
##  1  2017 F     Emma     19738 0.0105 
##  2  2017 M     Liam     18728 0.00954
##  3  2017 F     Olivia   18632 0.00994
##  4  2017 M     Noah     18326 0.00933
##  5  2017 F     Ava      15902 0.00848
##  6  2017 F     Isabella 15100 0.00805
##  7  2017 M     William  14904 0.00759
##  8  2017 F     Sophia   14831 0.00791
##  9  2017 M     James    14232 0.00725
## 10  2017 M     Logan    13974 0.00712
## # … with 1,924,655 more rows

Think you have it? Click Continue to test yourself.

2.2.5.4 arrange() quiz

Which name was the most popular for a single gender in a single year? In the code chunk below, use arrange() to make the row with the largest value of prop appear at the top of the data set.

arrange(babynames, desc(prop))
## # A tibble: 1,924,665 × 5
##     year sex   name        n   prop
##    <dbl> <chr> <chr>   <int>  <dbl>
##  1  1880 M     John     9655 0.0815
##  2  1881 M     John     8769 0.0810
##  3  1880 M     William  9532 0.0805
##  4  1883 M     John     8894 0.0791
##  5  1881 M     William  8524 0.0787
##  6  1882 M     John     9557 0.0783
##  7  1884 M     John     9388 0.0765
##  8  1882 M     William  9298 0.0762
##  9  1886 M     John     9026 0.0758
## 10  1885 M     John     8756 0.0755
## # … with 1,924,655 more rows
arrange(babynames, desc(n))
## # A tibble: 1,924,665 × 5
##     year sex   name        n   prop
##    <dbl> <chr> <chr>   <int>  <dbl>
##  1  1947 F     Linda   99686 0.0548
##  2  1948 F     Linda   96209 0.0552
##  3  1947 M     James   94756 0.0510
##  4  1957 M     Michael 92695 0.0424
##  5  1947 M     Robert  91642 0.0493
##  6  1949 F     Linda   91016 0.0518
##  7  1956 M     Michael 90620 0.0423
##  8  1958 M     Michael 90520 0.0420
##  9  1948 M     James   88588 0.0497
## 10  1954 M     Michael 88514 0.0428
## # … with 1,924,655 more rows
"The number of children represented by each proportion grew over time as the population grew."

2.2.6 %>%

2.2.6.1 Steps

Notice how each dplyr function takes a data frame as input and returns a data frame as output. This makes the functions easy to use in a step by step fashion. For example, you could:

  1. Filter babynames to just boys born in 2017
  2. Select the name and n columns from the result
  3. Arrange those columns so that the most popular names appear near the top.
boys_2017 <- filter(babynames, year == 2017, sex == "M")
boys_2017 <- select(boys_2017, name, n)
boys_2017 <- arrange(boys_2017, desc(n))
boys_2017
## # A tibble: 14,160 × 2
##    name         n
##    <chr>    <int>
##  1 Liam     18728
##  2 Noah     18326
##  3 William  14904
##  4 James    14232
##  5 Logan    13974
##  6 Benjamin 13733
##  7 Mason    13502
##  8 Elijah   13268
##  9 Oliver   13141
## 10 Jacob    13106
## # … with 14,150 more rows

2.2.6.2 Redundancy

The result shows us the most popular boys names from 2017, which is the most recent year in the data set. But take a look at the code. Do you notice how we re-create boys_2017 at each step so we will have something to pass to the next step? This is an inefficient way to write R code.

You could avoid creating boys_2017 by nesting your functions inside of each other, but this creates code that is hard to read:

arrange(select(filter(babynames, year == 2017, sex == "M"), name, n), desc(n))
## # A tibble: 14,160 × 2
##    name         n
##    <chr>    <int>
##  1 Liam     18728
##  2 Noah     18326
##  3 William  14904
##  4 James    14232
##  5 Logan    13974
##  6 Benjamin 13733
##  7 Mason    13502
##  8 Elijah   13268
##  9 Oliver   13141
## 10 Jacob    13106
## # … with 14,150 more rows

The dplyr package provides a third way to write sequences of functions: the pipe.

2.2.6.3 %>%

The pipe operator %>% performs an extremely simple task: it passes the result on its left into the first argument of the function on its right. Or put another way, x %>% f(y) is the same as f(x, y). This piece of code punctuation makes it easy to write and read series of functions that are applied in a step by step way. For example, we can use the pipe to rewrite our code above:

babynames %>% 
  filter(year == 2017, sex == "M") %>% 
  select(name, n) %>% 
  arrange(desc(n))
## # A tibble: 14,160 × 2
##    name         n
##    <chr>    <int>
##  1 Liam     18728
##  2 Noah     18326
##  3 William  14904
##  4 James    14232
##  5 Logan    13974
##  6 Benjamin 13733
##  7 Mason    13502
##  8 Elijah   13268
##  9 Oliver   13141
## 10 Jacob    13106
## # … with 14,150 more rows

As you read the code, pronounce %>% as “then”. You’ll notice that dplyr makes it easy to read pipes. Each function name is a verb, so our code resembles the statement, “Take babynames, then filter it by name and sex, then select the name and n columns, then arrange the results by descending values of n.”

dplyr also makes it easy to write pipes. Each dplyr function returns a data frame that can be piped into another dplyr function, which will accept the data frame as its first argument. In fact, dplyr functions are written with pipes in mind: each function does one simple task. dplyr expects you to use pipes to combine these simple tasks to produce sophisticated results.

2.2.6.4 Exercise - Pipes

I’ll use pipes for the remainder of the tutorial, and I will expect you to as well. Let’s practice a little by writing a new pipe in the chunk below. The pipe should:

  1. Filter babynames to just the girls that were born in 2017
  2. Select the name and n columns
  3. Arrange the results so that the most popular names are near the top.

Try to write your pipe without copying and pasting the code from above.

babynames %>% 
  filter(year == 2017, sex == "F") %>% 
  select(name, n) %>% 
  arrange(desc(n))
## # A tibble: 18,309 × 2
##    name          n
##    <chr>     <int>
##  1 Emma      19738
##  2 Olivia    18632
##  3 Ava       15902
##  4 Isabella  15100
##  5 Sophia    14831
##  6 Mia       13437
##  7 Charlotte 12893
##  8 Amelia    11800
##  9 Evelyn    10675
## 10 Abigail   10551
## # … with 18,299 more rows

2.2.6.5 Your name

You’ve now mastered a set of skills that will let you easily plot the popularity of your name over time. In the code chunk below, use a combination of dplyr and ggplot2 functions with %>% to:

  1. Trim babynames to just the rows that contain your name and your sex
  2. Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice)
  3. Plot the results as a line graph with year on the x axis and prop on the y axis

Note that the first argument of ggplot() takes a data frame, which means you can add ggplot() directly to the end of a pipe. However, you will need to switch from %>% to + to finish adding layers to your plot.

babynames %>% 
   filter(name == "John", sex == "M") %>%
   select(year, prop) %>%
   ggplot() +
      geom_line(aes(x = year, y = prop))

#### Recap

Together, select(), filter(), and arrange() let you quickly find information displayed within your data.

The next tutorial will show you how to derive information that is implied by your data, but not displayed within your data set.

In that tutorial, you will continue to use the %>% operator, which is an essential part of programming with the dplyr library.

Pipes help make R expressive, like a spoken language. Spoken languages consist of simple words that you combine into sentences to create sophisticated thoughts.

In the tidyverse, functions are like words: each does one simple task well. You can combine these tasks into pipes with %>% to perform complex, customized procedures.

2.3 Deriving Information with dplyr

Data sets contain more information than they display, and this tutorial will show you how to access that information. You’ll learn to derive new variables and to compute groupwise summary statistics.

2.3.1 Welcome

In this case study, you will identify the most popular American names from 1880 to 2015. While doing this, you will master three more dplyr functions:

  • mutate(), group_by(), and summarize(), which help you use your data to compute new variables and summary statistics

These are some of the most useful R functions for data science, and this tutorial provides everything you need to learn them.

This tutorial uses the core tidyverse packages, including ggplot2, tibble, and dplyr, as well as the babynames package. All of these packages have been pre-installed and pre-loaded for your convenience.

Click the Next Topic button to begin.

2.3.3 summarise()

summarise() takes a data frame and uses it to calculate a new data frame of summary statistics.

2.3.3.1 Syntax

To use summarise(), pass it a data frame and then one or more named arguments. Each named argument should be set to an R expression that generates a single value. Summarise will turn each named argument into a column in the new data frame. The name of each argument will become the column name, and the value returned by the argument will become the column contents.

2.3.3.2 Example

I used summarise() above to calculate the total number of boys named “Garrett”, but let’s expand that code to also calculate

  • max - the maximum number of boys named “Garrett” in a single year
  • mean - the mean number of boys named “Garrett” per year
babynames %>% 
  filter(name == "Garrett", sex == "M") %>% 
  summarise(total = sum(n), max = max(n), mean = mean(n))
## # A tibble: 1 × 3
##    total   max  mean
##    <int> <int> <dbl>
## 1 129759  5840  940.

Don’t let the code above fool you. The first argument of summarise() is always a data frame, but when you use summarise() in a pipe, the first argument is provided by the pipe operator, %>%. Here the first argument will be the data frame that is returned by babynames %>% filter(name == "Garrett", sex == "M").

2.3.3.3 Exercise - summarise()

Use the code chunk below to compute three statistics:

  1. the total number of children who ever had your name
  2. the maximum number of children given your name in a single year
  3. the mean number of children given your name per year

If you cannot think of an R function that would compute each statistic, click the Hint/Solution button.

babynames %>% 
  filter(name == "John", sex == "M") %>% 
  summarise(total = sum(n), max = max(n), mean = mean(n))
## # A tibble: 1 × 3
##     total   max   mean
##     <int> <int>  <dbl>
## 1 5115466 88318 37069.

2.3.3.4 Summary functions

So far our summarise() examples have relied on sum(), max(), and mean(). But you can use any function in summarise() so long as it meets one criteria: the function must take a vector of values as input and return a single value as output. Functions that do this are known as summary functions and they are common in the field of descriptive statistics. Some of the most useful summary functions include:

  1. Measures of location - mean(x), median(x), quantile(x, 0.25), min(x), and max(x)
  2. Measures of spread - sd(x), var(x), IQR(x), and mad(x)
  3. Measures of position - first(x), nth(x, 2), and last(x)
  4. Counts - n_distinct(x) and n(), which takes no arguments, and returns the size of the current group or data frame.
  5. Counts and proportions of logical values - sum(!is.na(x)), which counts the number of TRUEs returned by a logical test; mean(y == 0), which returns the proportion of TRUEs returned by a logical test.

Let’s apply some of these summary functions. Click Continue to test your understanding.

2.3.3.5 Khaleesi challenge

“Khaleesi” is a very modern name that appears to be based on the Game of Thrones TV series, which premiered on April 17, 2011. In the chunk below, filter babynames to just the rows where name == “Khaleesi”. Then use summarise() and a summary function to return the first value of year in the data set.

babynames %>%
   filter(name == "Khaleesi") %>%
   summarize(year = first(year))
## # A tibble: 1 × 1
##    year
##   <dbl>
## 1  2011

2.3.3.6 Distinct name challenge

In the chunk below, use summarise() and a summary function to return a data frame with two columns:

  • A column named n that displays the total number of rows in babynames
  • A column named distinct that displays the number of distinct names in babynames

Will these numbers be different? Why or why not?

babynames %>% 
   summarize(n(), distinct = n_distinct(name))
## # A tibble: 1 × 2
##     `n()` distinct
##     <int>    <int>
## 1 1924665    97310
"Good job! The two numbers are different because most names appear in the data set more than once. They appear once for each year in which they were used."

2.3.3.7 summarise by groups?

How can we apply summarise() to find the most popular names in babynames? You’ve seen how to calculate the total number of children that have your name, which provides one of our measures of popularity, i.e. the total number of children that have a name:

babynames %>% 
  filter(name == "Garrett", sex == "M") %>% 
  summarise(total = sum(n))

However, we had to isolate your name from the rest of your data to calculate this number. You could imagine writing a program that goes through each name one at a time and:

  1. filters out the rows with just that name
  2. applies summarise to the rows

Eventually, the program could combine all of the results back into a single data set. However, you don’t need to write such a program; this is the job of dplyr’s group_by() function.

2.3.4 group_by()

group_by() takes a data frame and then the names of one or more columns in the data frame. It returns a copy of the data frame that has been “grouped” into sets of rows that share identical combinations of values in the specified columns.

2.3.4.1 group_by() in action

For example, the result below is grouped into rows that have the same combination of year and sex values: boys in 1880 are treated as one group, girls in 1880 as another group and so on.

babynames %>%
  group_by(year, sex)
## # A tibble: 1,924,665 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows

2.3.4.2 Using group_by()

By itself, group_by() doesn’t do much. It assigns grouping criteria that is stored as metadata alongside the original data set. If your dataset is a tibble, as above, R will tell you that the data is grouped at the top of the tibble display. In all other aspects, the data looks the same.

However, when you apply a dplyr function like summarise() to grouped data, dplyr will execute the function in a groupwise manner. Instead of computing a single summary for the entire data set, dplyr will compute individual summaries for each group and return them as a single data frame. The data frame will contain the summary columns as well as the columns in the grouping criteria, which makes the result decipherable:

babynames %>%
  group_by(year, sex) %>% 
  summarise(total = sum(n))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 276 × 3
##     year sex    total
##    <dbl> <chr>  <int>
##  1  1880 F      90993
##  2  1880 M     110491
##  3  1881 F      91953
##  4  1881 M     100743
##  5  1882 F     107847
##  6  1882 M     113686
##  7  1883 F     112319
##  8  1883 M     104627
##  9  1884 F     129020
## 10  1884 M     114442
## # … with 266 more rows

To understand exactly what group_by() is doing, remove the line group_by(year, sex) %>% from the code above and rerun it. How do the results change?

babynames %>%
  summarise(total = sum(n))
## # A tibble: 1 × 1
##       total
##       <int>
## 1 348120517

2.3.4.3 Ungrouping 1

If you apply summarise() to grouped data, summarise() will return data that is grouped in a similar, but not identical fashion. summarise() will remove the last variable in the grouping criteria, which creates a data frame that is grouped at a higher level. For example, this summarise() statement receives a data frame that is grouped by year and sex, but it returns a data frame that is grouped only by year.

babynames %>%
  group_by(year, sex) %>% 
  summarise(total = sum(n))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 276 × 3
##     year sex    total
##    <dbl> <chr>  <int>
##  1  1880 F      90993
##  2  1880 M     110491
##  3  1881 F      91953
##  4  1881 M     100743
##  5  1882 F     107847
##  6  1882 M     113686
##  7  1883 F     112319
##  8  1883 M     104627
##  9  1884 F     129020
## 10  1884 M     114442
## # … with 266 more rows

2.3.4.4 Ungrouping 2

If only one grouping variable is left in the grouping criteria, summarise() will return an ungrouped data set. This feature let’s you progressively “unwrap” a grouped data set:

If we add another summarise() to our pipe,

  1. our data set will first be grouped by year and sex.
  2. Then it will be summarised into a data set grouped by year (i.e. the result above)
  3. Then be summarised into a final data set that is not grouped.
babynames %>%
  group_by(year, sex) %>% 
  summarise(total = sum(n)) %>% 
  summarise(total = sum(total))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 138 × 2
##     year  total
##    <dbl>  <int>
##  1  1880 201484
##  2  1881 192696
##  3  1882 221533
##  4  1883 216946
##  5  1884 243462
##  6  1885 240854
##  7  1886 255317
##  8  1887 247394
##  9  1888 299473
## 10  1889 288946
## # … with 128 more rows

2.3.4.5 Ungrouping 3

If you wish to manually remove the grouping criteria from a data set, you can do so with ungroup().

babynames %>%
  group_by(year, sex) %>% 
  ungroup()
## # A tibble: 1,924,665 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows

2.3.4.6 Ungrouping 3

And, you can override the current grouping information with a new call to group_by().

babynames %>%
  group_by(year, sex) %>% 
  group_by(name)
## # A tibble: 1,924,665 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows

That’s it. Between group_by(), summarise(), and ungroup(), you have a toolkit for taking groupwise summaries of your data at various levels of grouping

2.3.5 mutate()

2.3.5.1 The total number of children by year

Why might there be a difference between the proportion of children who receive a name over time, and the number of children who receive the name?

An obvious culprit could be the total number of children born per year. If more children are born each year, the number of children who receive a name could grow even if the proportion of children given that name declines.

Test this theory in the chunk below. Use babynames and groupwise summaries to compute the total number of children born each year and then to plot that number vs. year in a line graph.

babynames %>% 
   group_by(year) %>%
   summarize(total = sum(n)) %>%
   ggplot() +
      geom_line(aes(x = year, y = total))

#### Popularity based on rank

The graph above suggests that our first definition of popularity is confounded with population growth: the most popular names in 2015 likely represent far more children than the most popular names in 1880. The total number of children given a name may still be the best definition of popularity to use, but it will overweight names that have been popular in recent years.

There is also evidence that our definition is confounded with a gender effect: only one of the top ten names was a girl’s name.

If you are concerned about these things, you might prefer to use our second definition of popularity, which would give equal representation to each year and gender:

  1. Ranks - A name is popular if it consistently ranks among the top names from year to year.

To use this definition, we could:

  1. Compute the rank of each name within each year and gender. The most popular name would receive the rank 1 and so on.
  2. Find the median rank for each name, accounting for gender. The names with the lowest median would be the names that “consistently rank among the top names from year to year.”

To do this, we will need to learn one last dplyr function.

2.3.5.2 mutate()

mutate() uses a data frame to compute new variables. It then returns a copy of the data frame that includes the new variables. For example, we can use mutate() to compute a percent variable for babynames. Here percent is just the prop multiplied by 100 and rounded to two decimal places.

babynames %>%
  mutate(percent = round(prop * 100, 2))
## # A tibble: 1,924,665 × 6
##     year sex   name          n   prop percent
##    <dbl> <chr> <chr>     <int>  <dbl>   <dbl>
##  1  1880 F     Mary       7065 0.0724    7.24
##  2  1880 F     Anna       2604 0.0267    2.67
##  3  1880 F     Emma       2003 0.0205    2.05
##  4  1880 F     Elizabeth  1939 0.0199    1.99
##  5  1880 F     Minnie     1746 0.0179    1.79
##  6  1880 F     Margaret   1578 0.0162    1.62
##  7  1880 F     Ida        1472 0.0151    1.51
##  8  1880 F     Alice      1414 0.0145    1.45
##  9  1880 F     Bertha     1320 0.0135    1.35
## 10  1880 F     Sarah      1288 0.0132    1.32
## # … with 1,924,655 more rows

2.3.5.3 Exercise - mutate()

The syntax of mutate is similar to summarise(). mutate() takes first a data frame, and then one or more named arguments that are set equal to R expressions. mutate() turns each named argument into a column. The name of the argument becomes the column name and the result of the R expression becomes the column contents.

Use mutate() in the chunk below to create a births column, the result of dividing n by prop. You can think of births as a sanity check; it uses each row to double check the number of boys or girls that were born each year. If all is well, the numbers will agree across rows (allowing for rounding errors).

babynames %>% 
   mutate(births = n / prop)
## # A tibble: 1,924,665 × 6
##     year sex   name          n   prop births
##    <dbl> <chr> <chr>     <int>  <dbl>  <dbl>
##  1  1880 F     Mary       7065 0.0724 97605.
##  2  1880 F     Anna       2604 0.0267 97605.
##  3  1880 F     Emma       2003 0.0205 97605.
##  4  1880 F     Elizabeth  1939 0.0199 97605.
##  5  1880 F     Minnie     1746 0.0179 97605.
##  6  1880 F     Margaret   1578 0.0162 97605.
##  7  1880 F     Ida        1472 0.0151 97605.
##  8  1880 F     Alice      1414 0.0145 97605.
##  9  1880 F     Bertha     1320 0.0135 97605.
## 10  1880 F     Sarah      1288 0.0132 97605.
## # … with 1,924,655 more rows

2.3.5.4 Vectorized functions

Like summarise(), mutate() works in combination with a specific type of function. summarise() expects summary functions, which take vectors of input and return single values. mutate() expects vectorized functions, which take vectors of input and return vectors of values.

In other words, summary functions like min() and max() won’t work well with mutate(). You can see why if you take a moment to think about what mutate() does: mutate() adds a new column to the original data set. In R, every column in a dataset must be the same length, so mutate() must supply as many values for the new column as there are in the existing columns.

If you give mutate() an expression that returns a single value, it will follow R’s recycling rules and repeat that value as many times as needed to fill the column. This can make sense in some cases, but the reverse is never true: you cannot give summarise() a vectorized function; summarise() needs its input to return a single value.

What are some of R’s vectorized functions? Click Continue to find out.

2.3.5.5 The most useful vectorized functions

Some of the most useful vectorised functions in R to use with mutate() include:

  1. Arithmetic operators - +, -, *, /, ^. These are all vectorised, using R’s so called “recycling rules”. If one vector of input is shorter than the other, it will automatically be repeated multiple times to create a vector of the same length.
  2. Modular arithmetic: %/% (integer division) and %% (remainder)
  3. Logical comparisons, <, <=, >, >=, !=
  4. Logs - log(x), log2(x), log10(x)
  5. Offsets - lead(x), lag(x)
  6. Cumulative aggregates - cumsum(x), cumprod(x), cummin(x), cummax(x), cummean(x)
  7. Ranking - min_rank(x), row_number(x), dense_rank(x), percent_rank(x), cume_dist(x), ntile(x)

For ranking, I recommend that you use min_rank(), which gives the smallest values the top ranks. To rank in descending order, use the familiar desc() function, e.g.

min_rank(c(50, 100, 1000))
## [1] 1 2 3
min_rank(desc(c(50, 100, 1000)))
## [1] 3 2 1

2.3.5.6 Exercise - Ranks

Let’s practice by ranking the entire dataset based on prop. In the chunk below, use mutate() and min_rank() to rank each row based on its prop value, with the highest values receiving the top ranks.

babynames %>%
   mutate(rank = min_rank(desc(prop)))
## # A tibble: 1,924,665 × 6
##     year sex   name          n   prop  rank
##    <dbl> <chr> <chr>     <int>  <dbl> <int>
##  1  1880 F     Mary       7065 0.0724    14
##  2  1880 F     Anna       2604 0.0267   709
##  3  1880 F     Emma       2003 0.0205  1131
##  4  1880 F     Elizabeth  1939 0.0199  1192
##  5  1880 F     Minnie     1746 0.0179  1427
##  6  1880 F     Margaret   1578 0.0162  1683
##  7  1880 F     Ida        1472 0.0151  1897
##  8  1880 F     Alice      1414 0.0145  2039
##  9  1880 F     Bertha     1320 0.0135  2279
## 10  1880 F     Sarah      1288 0.0132  2387
## # … with 1,924,655 more rows

2.3.5.7 Rankings by group

In the previous exercise, we assigned rankings across the entire data set. For example, with the exception of ties, there was only one 1 in the entire data set, only one 2, and so on. To calculate a popularity score across years, you will need to do something different: you will need to assign rankings within groups of year and sex. Now there will be one 1 in each group of year and sex.

To rank within groups, combine mutate() with group_by(). Like dplyr’s other functions, mutate() will treat grouped data in a group-wise fashion.

Add group_by() to our code from above, to calculate ranking within year and sex combinations. Do you notice the numbers change?

babynames %>%
  group_by(year, sex) %>%
  mutate(rank = min_rank(desc(prop)))
## # A tibble: 1,924,665 × 6
##     year sex   name          n   prop  rank
##    <dbl> <chr> <chr>     <int>  <dbl> <int>
##  1  1880 F     Mary       7065 0.0724     1
##  2  1880 F     Anna       2604 0.0267     2
##  3  1880 F     Emma       2003 0.0205     3
##  4  1880 F     Elizabeth  1939 0.0199     4
##  5  1880 F     Minnie     1746 0.0179     5
##  6  1880 F     Margaret   1578 0.0162     6
##  7  1880 F     Ida        1472 0.0151     7
##  8  1880 F     Alice      1414 0.0145     8
##  9  1880 F     Bertha     1320 0.0135     9
## 10  1880 F     Sarah      1288 0.0132    10
## # … with 1,924,655 more rows

2.3.5.9 Recap

In this primer, you learned three functions for isolating data within a table:

  • select()
  • filter()
  • arrange()

You also learned three functions for deriving new data from a table:

  • summarise()
  • group_by()
  • mutate()

Together these six functions create a grammar of data manipulation, a system of verbs that you can use to manipulate data in a sophisticated, step-by-step way. These verbs target the everyday tasks of data analysis. No matter which types of data you work with, you will discover that:

  1. Data sets often contain more information than you need
  2. Data sets imply more information than they display

The six dplyr functions help you work with these realities by isolating and revealing the information contained in your data. In fact, dplyr provides more than six functions for this grammar: dplyr comes with several functions that are variations on the themes of select(), filter(), summarise(), and mutate(). Each follows the same pipeable syntax that is used throughout dplyr. If you are interested, you can learn more about these peripheral functions in the dplyr cheatsheet.

2.3.6 Challenges

Apply your knowledge of dplyr to do the following two challenges.

2.3.6.1 Number Ones Challenge - boys

How many distinct boys names acheived a rank of Number 1 in any year?

top_male <- babynames %>% 
  group_by(year, sex) %>% 
  mutate(rank = min_rank(desc(n))) %>% 
  filter(rank == 1, sex == "M")
unique(top_male$name)
## [1] "John"    "Robert"  "James"   "Michael" "David"   "Jacob"   "Noah"   
## [8] "Liam"
babynames %>% 
  group_by(year, sex) %>% 
  mutate(rank = min_rank(desc(n))) %>% 
  filter(rank == 1, sex == "M") %>% 
  ungroup() %>% 
  summarise(distinct = n_distinct(name))
## # A tibble: 1 × 1
##   distinct
##      <int>
## 1        8
babynames %>% 
  group_by(year, sex) %>% 
  mutate(rank = min_rank(desc(n))) %>% 
  filter(rank == 1, sex == "M") %>% 
  ungroup() %>% 
  group_by(name) %>%
  summarise(distinct = n_distinct(year)) %>%
  arrange(desc(distinct))
## # A tibble: 8 × 2
##   name    distinct
##   <chr>      <int>
## 1 John          44
## 2 Michael       44
## 3 Robert        17
## 4 Jacob         14
## 5 James         13
## 6 Noah           4
## 7 David          1
## 8 Liam           1

2.3.6.2 Number Ones Challenge - girls

How many distinct girls names acheived a rank of Number 1 in any year?

babynames %>% 
  group_by(year, sex) %>% 
  mutate(rank = min_rank(desc(n))) %>% 
  filter(rank == 1, sex == "F") %>% 
  ungroup() %>% 
  summarise(distinct = n_distinct(name))
## # A tibble: 1 × 1
##   distinct
##      <int>
## 1       10
babynames %>% 
  group_by(year, sex) %>% 
  mutate(rank = min_rank(desc(n))) %>% 
  filter(rank == 1, sex == "F") %>% 
  ungroup() %>% 
  group_by(name) %>%
  summarise(distinct = n_distinct(year)) %>%
  arrange(desc(distinct))
## # A tibble: 10 × 2
##    name     distinct
##    <chr>       <int>
##  1 Mary           76
##  2 Jennifer       15
##  3 Emily          12
##  4 Jessica         9
##  5 Lisa            8
##  6 Linda           6
##  7 Emma            5
##  8 Sophia          3
##  9 Ashley          2
## 10 Isabella        2

2.3.6.3 Number Ones Challenge - Plot

number_ones is a vector of every boys name to acheive a rank of one.

number_ones ## [1] “John” “Robert” “James” “Michael” “David” “Jacob” “Noah”
## [8] “Liam” Use number_ones with babynames to recreate the plot below, which shows the popularity over time for every name in number_ones.

image

2.3.6.4 Name Diversity Challenge - number of unique names

Which gender uses more names?

In the chunk below, calculate and then plot the number of distinct names used each year for boys and girls. Place year on the x axis, the number of distinct names on they y axis and color the lines by sex.

babynames %>%
  group_by(year, sex) %>%
  summarize(distinct_names = n_distinct(name)) %>%
  ggplot() +
    geom_line(aes(x = year, y = distinct_names, color = sex))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

What about the code below? Are these same?

babynames %>%
  group_by(year, sex) %>%
  mutate(distinct_names = n_distinct(name)) %>%
  ggplot() +
    geom_line(aes(x = year, y = distinct_names, color = sex))

2.3.6.5 Name Diversity Challenge - number of boys and girls

Let’s make sure that we’re not confounding our search with the total number of boys and girls born each year. With the chunk below, calculate and then plot over time the total number of boys and girls by year. Is the relative number of boys and girls constant?

babynames %>%
  group_by(year, sex) %>%
  summarize(total = sum(n)) %>%
  ggplot() + 
    geom_line(aes(x = year, y = total, color = sex))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

#### Name Diversity Challenge - children per name

Hmm. Sometimes there are more girls and sometimes more boys. In addition, the entire population has been grown over time. Let’s account for this weith a new metric: the average number of children per name.

If girls have a smaller number of children per name, that would imply that they use more names overall (and vice versa).

In the chunk below, calculate and plot the average number of children per name by year and sex over time. How do you interpret the results?

babynames %>%
  group_by(year, sex) %>%
  summarize(average = mean(n)) %>%
  ggplot() + 
    geom_line(aes(x = year, y = average, color = sex))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

2.3.6.6 Where to from here

Congratulations! You can use dplyr’s grammar of data manipulation to access any data associated with a table—even if that data is not currently displayed by the table.

In other words, you now know how to look at data in R, as well as how to access specific values, calculate summary statistics, and compute new variables. When you combine this with the visualization skills that you learned in Visualization Basics, you have everything that you need to begin exploring data in R.

The next tutorial will teach you the last of three basic skills for working with R:

  1. How to visualize data
  2. How to work with data
  3. How to program with R code

3 Visualize Data

Learn how to use ggplot2 to make any type of plot with your data. Then learn the best ways to visualize patterns within values and relationships between variables.

If you’re ready to begin, go to the first tutorial. There is no need to install or download anything. Each tutorial has everything you need to write and run R code, right in the tutorial.

3.1 Exploratory Data Analysis

Start here to learn how to explore your data with visualizations, using a strategy known as Exploratory Data Analysis (EDA).

3.1.1 Welcome

This tutorial will show you how to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. In the tutorial you will:

  • Learn a strategy for exploring data
  • Practice finding patterns in data
  • Get tips about how to use different types of plots to explore data

The tutorial is excerpted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.

3.1.2 Exploratory Data Analysis

3.1.2.1 What is EDA?

EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

  1. Generate questions about your data

  2. Search for answers by visualising, transforming, and/or modeling your data

  3. Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.

3.1.2.2 The EDA mindset

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA, you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on lines of inquiry that reveal insights worth writing up and communicating to others.

3.1.2.3 Questions

Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

3.1.2.4 Quantity vs Quality

EDA is, fundamentally, a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will highlight a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

3.1.2.5 Two useful questions

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables?

The rest of this tutorial will look at these two questions. To make the discussion easier, let’s define some terms…

3.1.2.5.1 Definitions
  • A variable is a quantity, quality, or property that you can measure.
  • A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
  • An observation or case is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a case or data point.
  • Tabular data is a table of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own cell, each variable in its own column, and each observation in its own row.
  • So far, all of the data that you’ve seen has been tidy. In real-life, most data isn’t tidy, so we’ll come back to these ideas again in Data Wrangling.

3.1.2.6 Review 1 - Discovery or Confirmation?

You can think of science as a process with two steps: discovery and confirmation. Scientists first observe the world to discover a hypothesis to test. Then, they devise a test to confirm the hypotheses against new data. If a hypothesis survives many tests, scientists begin to trust that it is a reliable explanation of the data.

The separation between discovery and confirmation is especially important for data scientists. It is easy for patterns to appear in data by coincidence. As a result, data scientists first look for patterns, and then try to confirm that the patterns exist in the real world. Sometimes this confirmation requires computing the probability that the pattern is due to random chance, a task that often involves collecting new data.

Is EDA a tool for discovery or confirmation?

  • Discovery ✓
  • Confirmation ✗
Correct!

EDA is a tool for discovery; in fact, EDA is one of the most fruitful tools for discovery in science. We'll focus on discovery throughout this primer, but remember that you should test any pattern that you discover before you rely on it.

3.1.2.7 Review 2 - Quality or Quantity?

When you begin to explore data, is it better to formulate one or two high-quality questions to ask, or many, many questions to explore?

  • One or two high-quality questions ✗
  • Many, many questions ✓
Correct!

Each question you ask creates a new opportunity to discover something surprising. You can lead yourself to high-value questions by iterating on questions that reveal unexpected results.

3.1.2.8 Review 3 - Definitions

iris is a famous toy data set that comes with R. The data set describes 150 iris flowers. Each row in iris displays a flower’s sepal and petal dimensions. You can use these measurements to deduce the flower’s species, which is also displayed in iris.

iris
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

3.1.2.9 Variables, values, and observations

Which of these is a variable in the iris dataset?

  • Sepal.Length ✓
  • flowers ✗
  • setosa ✗
  • 5.1 ✗
Correct!

Which of these is a value in the iris dataset?

  • Species ✗
  • 3.5 ✓
  • Petal.Length ✗
  • flowers ✗
Correct!

Which of these is an observation in the iris dataset?

  • The collection of measurements, 5.1, 3.5, 1.4, 0.2, and setosa, which describe the first flower in the data set. ✓
  • The collection of names, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. ✗
  • The collection of measurements, 5.1, 4.9, 4.7, and so on, which are all of the values in the Sepal.Length column. ✗
Correct!

These measurements were all collected under similar circumstances: on the same flower, presumably at the same time. If a relationship exists between the variables that these values describe, we would expect the relationship to also exist between these values.

3.1.3 Variation

3.1.3.1 What is variation?

Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice—and precisely enough—you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different objects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).

Every variable has its own pattern of variation, which can reveal useful information. The best way to understand that pattern is to visualise the distribution of the variable’s values. How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous.

3.1.3.2 Categorical variables

A variable is categorical if it can take only one of a small set of values. In R, categorical variables are usually saved as factors or character vectors. You can visualize the distribution of a categorical variable with a bar chart, like the one below.

chart Don’t worry if you cannot make or interpret a bar chart. We’ll survey several types of charts in this tutorial, as we create a strategy for EDA. You’ll learn how to build each type of chart in the tutorials that follow.

3.1.3.3 Continuous variables

A variable is continuous if it can take any of an infinite set of smooth, ordered values. Here, smooth means that if you order the values on a line, an infinite number of values would exist between any two points on the line. For example, an infinite number of values exists between 0 and 1, e.g. 0.9, 0.99, 0.999, and so on.

Numbers and date-times are two examples of continuous variables. You can visualize the distribution of a continuous variable with a histogram, like the one below:

image #### Frequencies

In both bar charts and histograms, tall bars show the common values of a variable, i.e. the values that appear frequently. Shorter bars show less-common values, i.e. values that appear infrequently. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:

  • Which values are the most common? Why?
  • Which values are rare? Why? Does that match your expectations?
  • Can you see any unusual patterns? What might explain them?
  • Are there any outliers, which are points that don’t fit the pattern or fall far away from the rest of the data? Are they the result of data entry errors or something else?

Many of the questions above will prompt you to explore a relationship between variables, to see if the values of one variable can explain the values of another variable. We’ll get to that shortly.

3.1.3.4 Review 4 - Frequencies

The bar chart below visualises the distribution of the class variable in the mpg data set, which comes in the ggplot2 package. The height of the bars reveal how many cars in the data set come from each class.

image The distribution of class in mpg

What is the most common type of car in the mpg data set?

  • 2seater ✗
  • compact ✗
  • midsize ✗
  • minivan ✗
  • pickup ✗
  • subcompact ✗
  • suv ✓

What is the least common type of car in the mpg data set?

  • 2seater ✓
  • compact ✗
  • midsize ✗
  • minivan ✗
  • pickup ✗
  • subcompact ✗
  • suv ✗
Correct!

Does the distribution of cars in the mpg dataset seem to reflect the distribution of cars that you see on the road? Would your answer shape how you use this data?

  • I have my answers ✓
Correct!

3.1.3.5 Clusters

For continuous variables, clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:

  • How are the observations within each cluster similar to each other?
  • How are the observations in separate clusters different from each other?
  • How can you explain or describe the clusters?
  • Why might the appearance of clusters be misleading?

3.1.3.6 Review 5 - Clusters

The histogram below shows the distribution of the eruptions variable in the faithful data set, which comes with R. eruptions shows the lengths (in minutes) of 272 eruptions of the Old Faithful geyser in Yellowstone National Park.

To interpret the histogram, look first at the x axis, which displays the lengths of eruptions recorded in the data. The range of the x axis shows that the shortest eruptions lasted for about one minute and the longest for about five minutes.

To see how many eruptions lasted for a specific length of time, find the length of time on the x axis and then look at the height of the bar above the length of time. For example, according to the histogram, 30 eruptions lasted for about two minutes, but only three lasted for about three minutes (the height of the bar above two is 30, the height of the bar above three is three).

image

Do the eruption lengths cluster into groups? How many?

  • No. There are no clusters. ✗
  • Yes. Two clusters. ✓
  • Yes. Three clusters. ✗
  • Yes. Four clusters. ✗
  • Correct!

Eruption lengths appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but few eruptions in between.

3.1.4 Covariation

3.1.4.1 What is covariation?

If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on whether your variables are categorical or continuous.

3.1.4.2 What is covariation?

If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on whether your variables are categorical or continuous.

Two categorical variables You can plot the relationship between two categorical variables with a heatmap or with geom_count:

image

image

Again, don’t be concerned if you do not know how to make these graphs. For now, let’s focus on the strategy of how to use visualizations in EDA. You’ll learn how to make different types of plots in the tutorials that follow.

3.1.4.3 One continuous and one categorical variable

You can plot the relationship between one continuous and one categorical variable with a boxplot:

image #### Two continuous variables

You can plot the relationship between two continuous variables with a scatterplot:

image

3.1.4.4 Patterns

Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:

  • Could this pattern be due to coincidence (i.e. random chance)?
  • How can you describe the relationship implied by the pattern?
  • How strong is the relationship implied by the pattern?
  • What other variables might affect the relationship?
  • Does the relationship change if you look at individual subgroups of the data?

Remember that clusters and outliers are also a type of pattern. Two dimensional plots can reveal clusters and outliers that would not be visible in a one dimensional plot. If you spot either, ask yourself what they imply.

3.1.4.5 Review 6 - Patterns

The scatterplot below shows the relationship between the length of an eruption of Old Faithful and the wait time before the eruption (i.e. the amount of time that passed between it and the previous eruption).

image Does the scatterplot above reveal a pattern that helps to explain the variation in lengths of Old Faithful eruptions?

  • No. There is no pattern. ✗
  • Yes. Long eruptions are associated with a short wait before the eruption ✗
  • Yes. Long eruptions are associated with a long wait before the eruption ✓
Correct!

The data seems to suggest that a long build up before an eruption is associated with a long eruption. The plot also shows the two clusters that we saw before: there are long eruptions with a long build up and short eruptions with a short build up.

3.1.4.6 Uncertainty

Patterns provide a useful tool for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. When two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), you can use the value of one variable to control the value of the second.

3.1.4.7 Recap

You’ve learned a lot in this tutorial. Here’s what you should keep with you:

  • EDA is an iterative cycle built around asking and refining questions.
  • These two questions are always useful:
    1. What type of variation occurs within my variables?
    2. What type of covariation occurs between my variables?
  • Remember the definitions of variables, values, observations, variation, covariation, categorical, and continuous. * You’ll see them again. Frequently.

Throughout the tutorial, you also encountered several recommendations for plots that visualize variation and covariation for categorical and continuous variables. Plots are a bit like questions in EDA: you should make many quickly and try anything that strikes your fancy. You can refine your plots later to share with others. A lot of refinement will occur naturally as you iterate during EDA.

The suggestions below can serve as starting point for visualizing data. In the tutorials that follow, you will learn how to make each type of plot, as well as how to use best practices and advanced skills when visualizing data.

image

3.2 Bar Charts

Learn to make and customize bar charts, a device for visualizing the distribution of categorical variables. Here, you will also learn to use ggplot2 position adjustments and facetting.

3.2.1 Welcome

This tutorial will show you how to make and enhance bar charts with the ggplot2 package. You will learn how to:

  • make and interpret bar charts
  • customize bar charts with aesthetics and parameters
  • use position adjustments
  • use facets to create subplots

The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.

The tutorial uses the ggplot2 and dplyr packages, which have been pre-loaded for your convenience.

3.2.2 Bar Charts

3.2.2.1 How to make a bar chart

To make a bar chart with ggplot2, add geom_bar() to the ggplot2 template. For example, the code below plots a bar chart of the cut variable in the diamonds dataset, which comes with ggplot2.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

3.2.2.2 The y axis

You should not supply a y aesthetic when you use geom_bar(); ggplot2 will count how many times each x value appears in the data, and then display the counts on the y axis. So, for example, the plot above shows that over 20,000 diamonds in the data set had a value of Ideal.

You can compute this information manually with the count() function from the dplyr package.

diamonds %>% 
  count(cut)
## # A tibble: 5 × 2
##   cut           n
##   <ord>     <int>
## 1 Fair       1610
## 2 Good       4906
## 3 Very Good 12082
## 4 Premium   13791
## 5 Ideal     21551

3.2.2.3 geom_col()

Sometimes, you may want to map the heights of the bars not to counts, but to a variable in the data set. To do this, use geom_col(), which is short for column.

ggplot(data = pressure) +
  geom_col(mapping = aes(x = temperature, y = pressure))

3.2.2.4 geom_col() data

When you use geom_col(), your x and y values should have a one to one relationship, as they do in the pressure data set (i.e. each value of temperature is paired with a single value of pressure).

pressure
## # A tibble: 19 × 2
##    temperature pressure
##          <dbl>    <dbl>
##  1           0   0.0002
##  2          20   0.0012
##  3          40   0.006 
##  4          60   0.03  
##  5          80   0.09  
##  6         100   0.27  
##  7         120   0.75  
##  8         140   1.85  
##  9         160   4.2   
## 10         180   8.8   
## 11         200  17.3   
## 12         220  32.1   
## 13         240  57     
## 14         260  96     
## 15         280 157     
## 16         300 247     
## 17         320 376     
## 18         340 558     
## 19         360 806

3.2.2.5 Exercise 1 - Make a bar chart

Use the code chunk below to plot the distribution of the color variable in the diamonds data set, which comes in the ggplot2 package.

ggplot(data = diamonds) +
  geom_bar(aes(x = color))

3.2.2.6 Exercise 2 - Interpretation

image #### Bar charts

What is the most common type of cut in the diamonds dataset?

  • Fair ✗
  • Good ✗
  • Very Good ✗
  • Premium ✗
  • Ideal ✓
Correct!

How many diamonds in the dataset had a Good cut?

  • ~2000 ✗
  • ~5000 ✓
  • ~7000 ✗
  • ~20000 ✗
Correct!

3.2.2.7 Exercise 3 - What went wrong?

Diagnose the error below and then fix the code chunk to make a plot.

ggplot(data = pressure) +
  geom_bar(mapping = aes(x = temperature, y = pressure))
ggplot(data = pressure) +
  geom_col(mapping = aes(x = temperature, y = pressure))

3.2.2.8 Exercise 4 - count() and col()

Recreate the bar graph of color from exercise one, but this time first use count() to manually compute the heights of the bars. Then use geom_col() to plot the results as a bar graph. Does your graph look the same as in exercise one?

diamonds %>% 
  count(color) %>%
  ggplot() +
  geom_col(aes(x = color, y = n))

The following create a table.

diamonds %>% 
  count(color)
## # A tibble: 7 × 2
##   color     n
##   <ord> <int>
## 1 D      6775
## 2 E      9797
## 3 F      9542
## 4 G     11292
## 5 H      8304
## 6 I      5422
## 7 J      2808

3.2.3 Aesthetics

3.2.3.1 Aesthetics for bars

geom_bar() and geom_col() can use several aesthetics:

  • alpha
  • color
  • fill
  • linetype
  • size

One of these, color, creates the most surprising results. Predict what the code below will return and then run it.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, color = cut))

3.2.3.2 fill

The color aesthetic controls the outline of each bar in your bar plot, which may not be what you want. To color the interior of each bar, use the fill aesthetic:

image image Use the code chunk below to experiment with fill, along with other geom_bar() aesthetics, like alpha, linetype, and size.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, alpha = 0.5))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut), width = 1)

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut), width = 0.8)

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut), width = 0.5)

Notice that width is a parameter, not an aesthetic mapping. Hence, you should set width outside of the aes() function.

3.2.3.3 Exercise 5 - aesthetics

Create a colored bar chart of the class variable from the mpg data set, which comes with ggplot2. Map the interior color of each bar to class.

ggplot(data = mpg) + 
  geom_bar(aes(class, fill = class))

3.2.4 Position adjustments

3.2.4.1 Positions

If you map fill to a new variable, geom_bar() will display a stacked bar chart:

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity))

This plot displays 40 different combinations of cut and clarity, each displayed by its own rectangle. geom_bar() lays out the rectangles by stacking rectangles that have the same cut value on top of one another. You can change this behavior with a position adjustment.

3.2.4.2 Position = “dodge”

To place rectangles that have the same cut value beside each other, set position = “dodge”.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

This plot shows the same rectangles as the previous chart; however, it lays out rectangles that have the same cut value beside each other.

3.2.4.3 Position = “stack”

To create the familiar stacked bar chart, set position = “stack” (which is the default for geom_bar()).

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")

3.2.4.4 Position = “fill”

To expand each bar to take up the entire y axis, set position = “fill”. ggplot2 will stack the rectangles and then scale them within each bar.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

This makes it easy to compare proportions. For example, you can scan across the bars to see how the proportion of IF diamonds changes from cut to cut.

` #### What is a position adjustment?

Every geom function in ggplot2 takes a position argument that is preset to a reasonable default. You can use position to determine how a geom should adjust objects that would otherwise overlap with each other.

For example, in our plot, each value of cut is associated with eight rectangles: one each for I1, SI2, SI1, VS2, VS1, VVS2, VVS1, and IF. Each of these eight rectangles deserves to go in the same place: directly above the value of cut that it is associated with, with the bottom of the rectangle placed at count = 0. But if we plotted the plot like that, the rectangles would overlap each other.

Here’s what that would look like if you could peek around the side of the graph.

image

3.2.4.5 Position = “identity”

..and here’s what that would look like if you could see the graph from the front. You can make this plot by setting position = “identity”.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity")

Position adjustments tell ggplot2 how to re-distribute objects when they overlap. position = “identity” is the “adjustment” that let’s objects overlap each other. It is a bad choice for bar graphs because the result looks like a stacked bar chart, even though it is not.

3.2.4.6 Exercise 6 - Positions

Use the code chunk to recreate the plot you see below. Remember: color is the name of a variable in diamonds (not to be confused with an aesthetic).

image

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = color, fill = clarity), position = "fill", width = 1)

3.2.4.7 Exercise 7 - Positions

Use the code chunk to recreate the plot you see below. Remember: color is the name of a variable in diamonds (not to be confused with an aesthetic).

image

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = color, fill = cut), position = "dodge")

3.2.4.8 Exercise 8 - position = “identity”

image #### Why is position = “identity” a bad idea?

Suppose the graph above uses position = “stack”. About how many diamonds have an ideal cut and a G color?

  • 5000 ✗
  • 3000 ✗
  • 1800 ✓
  • The graph doesn’t contain enough information to make an estimate. ✗
Correct!

In a stacked bar chart, you can calculate the number of observations in each bar by subtracting the y value at the bottom of the bar from the y value at the top.

Suppose the graph above uses position = “identity”. About how many diamonds have an ideal cut and a G color?

  • 5000 ✓
  • 3000 ✗
  • 1800 ✗
  • The graph doesn’t contain enough information to make an estimate. ✗
Correct!

Here the green bar extends all the way from 5000 to 0; most of the bar is behind the blue, purple, and magenta bars. In practice, you would never construct a bar chart like this.

3.2.5 Facets

3.2.5.1 Facetting

You can more easily compare subgroups of data if you place each subgroup in its own subplot, a process known as facetting.

image #### facet_grid()

ggplot2 provides two functions for facetting. facet_grid() divides the plot into a grid of subplots based on the values of one or two facetting variables. To use it, add facet_grid() to the end of your plot call.

The code chunks below, show three ways to facet with facet_grid(). Spot the differences between the chunks, then run the code to learn what the differences do.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = color)) +
  facet_grid(clarity ~ cut)

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = color)) +
  facet_grid(. ~ cut)

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = color)) +
  facet_grid(clarity ~ .)

3.2.5.2 facet_grid() recap

As you saw in the code examples, you use facet_grid() by passing it a formula, the names of two variables connected by a ~.

facet_grid() will split the plot into facets vertically by the values of the first variable: each facet will contain the observations that have a common value of the variable. facet_grid() will split the plot horizontally by values of the second variable. The result is a grid of facets, where each specific subplot shows a specific combination of values.

If you do not wish to split on the vertical or horizontal dimension, pass facet_grid() a . instead of a variable name as a place holder.

3.2.5.3 facet_wrap()

facet_wrap() provides a more relaxed way to facet a plot on a single variable. It will split the plot into subplots and then reorganize the subplots into multiple rows so that each plot has a more or less square aspect ratio. In short, facet_wrap() wraps the single row of subplots that you would get with facet_grid() into multiple rows.

To use facet_wrap() pass it a single variable name with a ~ before it, e.g. facet_wrap( ~ color).

Add facet_wrap() to the code below to create the graph that appeared at the start of this section. Facet on cut.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = color, fill = cut))
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = color, fill = cut)) + 
  facet_wrap(~ cut)

3.2.5.4 scales

By default, each facet in your plot will share the same x and y ranges. You can change this by adding a scales argument to facet_wrap() or facet_grid().

  • scales = “free” will let the x and y range of each facet vary
  • scales = “free_x” will let the x range of each facet vary, but not the y range
  • scales = “free_y” will let the y range of each facet vary, but not the x range. This is a convenient way to compare the shapes of different distributions:
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = color, fill = cut)) +
  facet_wrap( ~ cut, scales = "free_y")

3.2.5.5 Recap

In this tutorial, you learned how to make bar charts; but much of what you learned applies to other types of charts as well. Here’s what you should know:

  • Bar charts are the basis for histograms, which means that you can interpret histograms in a similar way.
  • Bars are not the only geom in ggplot2 that use the fill aesthetic. You can use both fill and color aesthetics with any geom that has an “interior” region.
  • You can use the same position adjustments with any ggplot2 geom: “identity”, “stack”, “dodge”, “fill”, “nudge”, and “jitter” (we’ll learn about “nudge” and “jitter” later). Each geom comes with its own sensible default.
  • You can facet any ggplot2 plot by adding facet_grid() or facet_wrap() to the plot call.

Bar charts are an excellent way to display the distribution of a categorical variable. In the next tutorial, we’ll meet a set of geoms that display the distribution of a continuous variable.

3.3 Histograms

Learn to make and customize histograms, a device for visualizing the distribution of continuous variables. Here, you will also learn to make similar plots like dotplots, densities, and frequency polygons.

3.3.1 Welcome

Histograms are the most popular way to visualize continuous distributions. Here we will look at them and their derivatives. You will learn how to:

  • Make and interpret histograms
  • Adjust the binwidth of a histogram to reveal new information
  • Use geoms that are similar to histograms, such as dotplots, frequency polygons, and densities

The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.

The tutorial uses the ggplot2 and dplyr packages, which have been pre-loaded for your convenience.

3.3.2 Histograms

3.3.2.1 Introduction

Video: https://vimeo.com/221607341

3.3.2.2 How to make a histogram

To make a histogram with ggplot2, add geom_histogram() to the ggplot2 template. For example, the code below plots a histogram of the carat variable in the diamonds dataset, which comes with ggplot2.

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.3.2.3 The y variable

As with geom_bar(), you do not need to give geom_histogram() a y variable. geom_histogram() will construct its own y variable by counting the number of observations that fall into each bin on the x axis. geom_histogram() will then map the counts to the y axis.

image As a result, you can glance at a bar to determine how many observations fall within a bin. Bins with tall bars highlight common values of the x variable.

3.3.2.4 Exercise 1 - Interpretation

image

According to the chart, which is the most common carat size in the data?

  • Approximately 0.3 or 0.4 carats ✓
  • Approximately 1 carat ✗
  • Approximately 1.5 carat ✗
  • Approximately 2 carats ✗
Correct!

More than 15,000 diamonds in the data have a value in the bin near 0.3 and 0.4. That's more than any other bin. How do we know? because the bar above 0.3 to 0.4 goes to 15,000, higher than any other bar in the plot.

3.3.2.5 binwidth

By default, ggplot2 will choose a binwidth for your histogram that results in about 30 bins. You can set the binwidth manually with the binwidth argument, which is interpreted in the units of the x axis:

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 1)

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

#### bins

Alternatively, you can set the binwidth with the bins argument which takes the total number of bins to use:

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), bins = 10)

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), bins = 20)

It can be hard to determine what the actual binwidths are when you use bins, since they may not be round numbers.

3.3.2.6 boundary

You can move the bins left and right along the x axis with the boundary argument. boundary takes an x value to use as the boundary between two bins (ggplot2 will align the rest of the bins accordingly):

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), bins = 10, boundary = 0)

3.3.2.7 Exercise 2 - binwidth

When you use geom_histogram(), you should always experiment with different binwidths because different size bins reveal different types of information.

To see an example of this, make a histogram of the carat variable in the diamonds dataset. Use a bin size of 0.5 carats. What does the overall shape of the distribution look like?

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

"Good job! The most common diamond size is about 0.5 carats. Larger sizes become progressively less frequent as carat size increases. This accords with general knowledge about diamonds, so you may be prompted to stop exploring the distribution of carat size. But should you?"

3.3.2.8 Exercise 3 - another binwidth

Recreate your histogram of carat but this time use a binwidth of 0.1. Does your plot reveal new information? Look closely. Is there more than one peak? Where do the peaks occur?

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.1)

"Good job! The new binwidth reveals a new phenomena: carat sizes like 0.5, 0.75, 1, 1.5, and 2 are much more common than carat sizes that do not fall near a common fraction. Why might this be?"

3.3.2.9 Exercise 4 - another binwidth

Recreate your histogram of carat a final time, but this time use a binwidth of 0.01 and set the first boundary to zero. Try to find one new pattern in the results.

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.01, boundary = 0)

"Good job! The new binwidth reveals another phenomena: each peak is very right skewed. In other words, diamonds that are 1.01 carats are much more common than diamonds that are .99 carats. Why would that be?"

3.3.2.10 aesthetics

Visually, histograms are very similar to bar charts. As a result, they use the same aesthetics: alpha, color, fill, linetype, and size.

They also behave in the same odd way when you use the color aesthetic. Do you remember what happens?

Which aesthetic would you use to color the interior fill of each bar in a histogram?

  • color ✗
  • fill ✓
Correct!

For geoms with "substance", like bars, fill controls the color of the interior of the geom. Color controls the outline.

3.3.2.11 Exercise 5 - Putting it all together

Recreate the histogram below.

image

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price, fill = cut), position = "stack", binwidth = 1000, boundary = 0)

"Good job! Did you ensure that each binwidth is 1000 and that the first boundary is zero?"

3.3.3 Similar geoms

3.3.3.1 A problem

By adding a fill color to our histogram below, we’ve divided the data into five “sub-distributions”: the distribution of price for Fair cut diamonds, for Good cut diamonds, for Very Good cut diamonds, for Premium cut diamonds, and for Ideal cut diamonds.

image But this display has some shortcomings:

  • it is difficult to see the “shapes” of the individual distributions
  • it is difficult to compare the individual distributions, because they do not share a common baseline value for y.

3.3.3.2 A solution

We can improve the plot by using a different geom to display the distributions of price values. ggplot2 includes three geoms that display the same information as a histogram, but in different ways:

  1. geom_freqpoly()
  2. geom_density()
  3. geom_dotplot()

3.3.3.3 geom_freqpoly()

geom_freqpoly() plots a frequency polygon, which uses a line to display the same information as a histogram. You can think of a frequency polygon as a line that would connect the top of each bar that appears in a histogram, like this:

image Note that the bars are not part of the frequency polygon; they are just there for reference. geom_freqpoly() recognizes the same parameters as geom_histogram(), such as bins, binwidth, and boundary.

3.3.3.4 Exercise 6 - Frequency polygons

Create the frequency polygon depicted above. It has a binwidth of 0.25 and starts at the boundary zero.

ggplot(data = diamonds) +
  geom_freqpoly(mapping = aes(x = carat), binwidth = 0.25, boundary = 0)

"Good job! By using a line instead of bars, frequency polygons can sometimes do things that histograms cannot."

3.3.3.5 Exercise 7 - Multiple frequency polygons

Use a frequency polygon to recreate our chart of price and cut. Since lines do not have “substance” like bars, you will want to use the color aesthetic instead of the fill aesthetic.

image

ggplot(data = diamonds) +
  geom_freqpoly(mapping = aes(x = carat, color = cut), binwidth = 0.25, boundary = 0)

"Good job! Since lines do not occlude each other, `geom_freqpoly()` plots each sub-group against the same baseline: y = 0 (i.e. it unstacks the sub-groups). This makes it easier to compare the distributions. You can now see that for almost every price value, there are more Ideal cut diamonds than there are other types of diamonds."

3.3.3.6 geom_density()

Our frequency polygon highlights a second shortcoming with our graph: it is difficult to compare the shapes of the distributions because some sub-groups contain more diamonds than others. This compresses smaller subgroups into the bottom of the graph.

image You can avoid this with geom_density().

3.3.3.7 Density curves

geom_density() plots a kernel density estimate (i.e. a density curve) for each distribution. This is a smoothed representation of the data, analogous to a smoothed histogram.

Density curves do not plot count on the y axis but density, which is analagous to the count divided by the total number of observations. Densities makes it easy to compare the distributions of sub-groups. When you plot multiple sub-groups, each density curve will contain the same area under its curve.

image

image #### Exercise 8 - Density curves

Re-draw our plot with density curves. How do you interpret the results?

image

ggplot(data = diamonds) +
  geom_density(mapping = aes(x = carat, color = cut))

"Good job! You can now compare the most common prices for each sub-group. For example, the plot shows that the most common price for most diamonds is near $1000. However, the most common price for Fair cut diamonds is significantly higher, about $2500. We will come back to this oddity in a later tutorial."

3.3.3.8 Density parameters

Density plots do not take bin, binwidth, and boundary parameters. Instead they recognize kernel and smoothing parameters that are used in the density fitting algorithm, which is fairly sophisticated.

In practice, you can obtain useful results quickly with the default parameters of geom_density(). If you’d like to learn more about density estimates and their tuning parameters, begin with the help page at ?geom_density().

3.3.3.9 geom_dotplot()

ggplot2 provides a final geom for displaying one dimensional distributions: geom_dotplot(). geom_dotplot() represents each observation with a dot and then stacks dots within bins to create the semblance of a histogram.

Dotplots can provide an intuitive display of the data, but they have several shortcomings. Dotplots are not ideal for large data sets like diamonds, and provide meaningless y axis labels. I also find that the tuning parameters of geom_dotplot() make dotplots too slow to work with for EDA.

ggplot(data = mpg) +
  geom_dotplot(mapping = aes(x = displ), dotsize = 0.5, stackdir = "up", stackratio = 1.1)
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

#### Exercise 9 - Facets

Instead of changing geoms, you can make the sub-groups in our original plot easier to compare by facetting the data. Extend the code below to facet by cut.

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price, fill = cut), binwidth = 1000, boundary = 0) + 
  facet_wrap(~cut)

"Good job! Facets make it easier to compare sub-groups, but at the expense of separating the data. You may decide that frequency polygons and densities allow more direct comparisons."

3.3.3.10 Recap

In this tutorial, you learned how to visualize distributions with histograms, frequency polygons, and densities. But what should you look for in these visualizations?

  • Look for places with lots of data. Tall bars reveal the most common values in your data; you can expect these values to be the “typical values” for your variable.

  • Look for places with little data. Short bars reveal uncommon values. These values appear rarely and you might be able to figure out why.

  • Look for outliers. Bars that appear away from the bulk of the data are outliers, special cases that may reveal unexpected insights.

Sometimes outliers cannot be seen in a plot, but can be inferred from the range of the x axis. For example, many of the plots in this tutorial seemed to extend well past the end of the data. Why? Because the range was stretched to include outliers. When your data set is large, like diamonds, the bar that describes an outlier may be invisible (i.e. less than one pixel high).

  • Look for clusters.

  • Look for shape. The shape of a histogram can often indicate whether or not a variable behaves according to a known probability distribution.

The most important thing to remember about histograms, frequency polygons, and dotplots is to explore different binwidths. The binwidth of a histogram determines what information the histogram displays. You cannot predict ahead of time which binwidth will reveal unexpected information.

3.4 Boxplots and Counts

Here you will learn to make and customize boxplots, a chart type that makes it easy to visualize the relationship between continuous and categorical variables. You will also learn to visualize the relationship between two categorical variables with a counts plot.

3.4.1 Welcome

Boxplots display the relationship between a continuous variable and a categorical variable. Count plots display the relationship between two categorical variables. In this tutorial, you will learn how to use both. You will learn how to:

  • Make and interpret boxplots
  • Rotate boxplots by flipping the coordinate system of your plot
  • Use violin plots and dotplots, two geoms that are similar to boxplots
  • Make and interpret count plots

The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.

The tutorial uses the ggplot2 and dplyr packages, which have been pre-loaded for your convenience.

3.4.2 Boxplots

3.4.2.1 Introduction

Vidoe: https://vimeo.com/222358034

3.4.2.2 Exercise 1 - Boxplots

image Which of the sub-plots accurately describes the data above with a boxplot?

  • A ✗
  • B ✗
  • C ✓
Correct!

3.4.2.3 How to make a boxplot

To make a boxplot with ggplot2, add geom_boxplot() to the ggplot2 template. For example, the code below uses boxplots to display the relationship between the class and hwy variables in the mpg dataset, which comes with ggplot2.

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = class, y = hwy))

#### Categorical and continuous

geom_boxplot() expects the y axis to be continuous, but accepts categorical variables on the x axis. For example, here class is categorical. geom_boxplot() will automatically plot a separate boxplot for each value of x. This makes it easy to compare the distributions of points with different values of x.

image #### Exercise 2 - Interpretation

image

Which class of car has the lowest median highway fuel efficiency (hwy value)?

  • 2seater ✗
  • compact ✗
  • midsize ✗
  • minivan ✗
  • pickup ✓
  • subcompact ✗
  • suv ✗
Correct!

3.4.2.4 Exercise 3 - Make a Boxplot

Recreate the boxplot below with the diamonds data set.

image

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = cut, y = price))

"Do you notice how many outliers appear in the plot? The boxplot algorithm can identify many outliers if your data is big, perhaps too many. Let's look at ways to suppress the appearance of outliers in your plot."

3.4.2.5 Outliers

You can change how outliers look in your boxplot with the parameters outlier.color, outlier.fill, outlier.shape, outlier.size, outlier.stroke, and outlier.alpha (outlier.shape takes a number from 1 to 25).

Unfortunately, you can’t tell geom_boxplot() to ignore outliers completely, but you can make outliers disappear by setting outlier.alpha = 0. Try it in the plot below.

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = cut, y = price), outlier.shape  = 24, 
               outlier.fill = "white", outlier.stroke = 0.25)

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = cut, y = price), outlier.shape  = 24, 
               outlier.fill = "white", outlier.stroke = 0.25, alpha = 0)

3.4.2.6 Aesthetics

Boxplots recognize the following aesthetics: alpha, color, fill, group, linetype, shape, size, and weight.

Of these group can be the most useful. Consider the plot below. It uses a continuous variable on the x axis. As a result, geom_boxplot() is not sure how to split the data into categories: it lumps all of the data into a single boxplot. The result reveals little about the relationship between carat and price.

image In the next sections, we’ll use group to make a more informative plot.

3.4.2.7 How to “cut” a continuous variable

ggplot2 provides three helper functions that you can use to split a continuous variable into categories. Each takes a continuous vector and returns a categorical vector that assigns each value to a group. For example, cut_interval() bins a vector into n equal length bins.

continuous_vector <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
continuous_vector
##  [1]  1  2  3  4  5  6  7  8  9 10
##  [1]  1  2  3  4  5  6  7  8  9 10
cut_interval(continuous_vector, n = 3)
##  [1] [1,4]  [1,4]  [1,4]  [1,4]  (4,7]  (4,7]  (4,7]  (7,10] (7,10] (7,10]
## Levels: [1,4] (4,7] (7,10]
##  [1] [1,4]  [1,4]  [1,4]  [1,4]  (4,7]  (4,7]  (4,7]  (7,10] (7,10] (7,10]
## Levels: [1,4] (4,7] (7,10]

3.4.2.8 The cut functions

The three cut functions are

  • cut_interval() which makes n groups with equal range
  • cut_number() which makes n groups with (approximately) equal numbers of observations
  • cut_width() which makes groups with width width

Use one of three functions below to bin continuous_vector into groups of width = 2.

cut_width(continuous_vector, width = 2)
##  [1] [1,3]  [1,3]  [1,3]  (3,5]  (3,5]  (5,7]  (5,7]  (7,9]  (7,9]  (9,11]
## Levels: [1,3] (3,5] (5,7] (7,9] (9,11]
"Good job! Now let's apply the cut functions to our graph."

3.4.2.9 Exercise 4 - Apply a cut function

When you set the group aesthetic of a boxplot, geom_boxplot() will draw a separate boxplot for each collection of observations that have the same value of whichever vector you map to group.

This means we can split our carat plot by mapping group to the output of a cut function, as in the code below. Study the code, then modify it to create a separate boxplot for each 0.25 wide interval of carat.

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = carat, y = price, group = cut_interval(carat, n = 2)))

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = carat, y = price, group = cut_width(carat, width = 0.25)))

"Good job! You can now see a relationship between price and carat. You could also make a scatterplot of these variables, but in this case, it would be a black mass of 54,000 data points."

3.4.2.10 coord_flip()

geom_boxplot() always expects the categorical variable to appear on the x axis, which create horizontal boxplots. But what if you’d like to make horizontal boxplots, like in the plot below?

image

3.4.2.11 Exercise 5 - Horizontal boxplots

Extend the code below to orient the boxplots horizontally.

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = class, y = hwy))

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = class, y = hwy)) + 
  coord_flip()

"Good job! `coord_flip()` is an example of a new coordinate system. You'll learn much more about ggplot2 coordinate systems in a later tutorial."

3.4.3 Similar Geoms

3.4.3.1 geom_dotplot()

Boxplots provide a quick way to represent a distribution, but they leave behind a lot of information. ggplot2 supplements boxplots with two geoms that show more information.

The first is geom_dotplot(). If you set the binaxis parameter of geom_dotplot() to “y”, geom_dotplot() behaves like geom_boxplot(), display a separate distribution for each group of data.

Here each group functions like a vertical histogram. Add the parameter stackdir = “center” then re-run the code. Can you interpret the results?

ggplot(data = mpg) +
  geom_dotplot(mapping = aes(x = class, y = hwy), binaxis = "y", 
               dotsize = 0.5, binwidth = 1)

ggplot(data = mpg) +
  geom_dotplot(mapping = aes(x = class, y = hwy), binaxis = "y", 
               dotsize = 0.5, binwidth = 1, stackdir = "center")

'Good job! When you set `stackdir = "center"`, `geom_dotplot()` arranges each row of dots symmetrically around the $x$ value. This layout will help you understand the next geom. As in the histogram tutorial, it takes a lot of tweaking to make a dotplot look right. As a result, I tend to only use them when I want to make a point.'

3.4.3.2 geom_violin()

geom_violin() provides a second alternative to geom_boxplot(). A violin plot uses densities to draw a smoothed version of the centered dotplot you just made.

You can think of a violin plot as an outline drawn around the edges of a centered dotplot. Each “violin” spans the range of the data. The violin is thick where there are many values, and thin where there are few.

Convert the plot below from a boxplot to a violin plot. Note that violin plots do not use the parameters you saw for dotplots.

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = class, y = hwy))

ggplot(data = mpg) +
  geom_violin(mapping = aes(x = class, y = hwy))

'Good job! Another way to interpret a violin plot is to mentally "push" the width of each violin all to one side (so the other side is a straight line). The result would be a density (e.g. `geom_density()`) turned on its side for each distribution).'

3.4.3.3 Exercise 7 - Violin plots

You can further enhance violin plots by adding the parameter draw_quantiles = c(0.25, 0.5, 0.75). This will cause ggplot2 to draw horizontal lines across the violins at the 25th, 50th, and 75th percentiles. These are the same three horizontal lines that are displayed in a boxplot (the 25th and 75th percentiles are the bounds of the box, the 50th percentile is the median).

Add these lines to the violin plot below.

ggplot(data = mpg) +
  geom_violin(mapping = aes(x = class, y = hwy))

ggplot(data = mpg) +
  geom_violin(mapping = aes(x = class, y = hwy), draw_quantiles = c(0.25, 0.5, 0.75))

3.4.4 Counts

3.4.4.1 geom_count()

Boxplots provide an efficient way to explore the interaction of a continuous variable and a categorical variable. But what if you have two categorical variables?

You can see how observations are distributed across two categorical variables with geom_count(). geom_count() draws a point at each combination of values from the two variables. The size of the point is mapped to the number of observations with this combination of values. Rare combinations will have small points, frequent combinations will have large points.

image

ggplot(data = diamonds) + 
  geom_count(mapping = aes(x = cut, y = clarity))

3.4.4.2 count

You can use the count() function in the dplyr package to compute the count values displayed by geom_count(). To use count(), pass it a data frame and then the names of zero or more variables in the data frame. count() will return a new table that lists how many observations occur with each possible combination of the listed variables.

So for example, the code below returns the counts that you visualized in Exercise 8.

diamonds %>% 
   count(cut, clarity)
## # A tibble: 40 × 3
##    cut   clarity     n
##    <ord> <ord>   <int>
##  1 Fair  I1        210
##  2 Fair  SI2       466
##  3 Fair  SI1       408
##  4 Fair  VS2       261
##  5 Fair  VS1       170
##  6 Fair  VVS2       69
##  7 Fair  VVS1       17
##  8 Fair  IF          9
##  9 Good  I1         96
## 10 Good  SI2      1081
## # … with 30 more rows

3.4.4.3 Heat maps

Heat maps provide a second way to visualize the relationship between two categorical variables. They work like count plots, but use a fill color instead of a point size, to display the number of observations in each combination.

3.4.4.4 How to make a heat map

ggplot2 does not provide a geom function for heat maps, but you can construct a heat map by plotting the results of count() with geom_tile().

To do this, set the x and y aesthetics of geom_tile() to the variables that you pass to count(). Then map the fill aesthetic to the n variable computed by count(). The plot below displays the same counts as the plot in Exercise 8.

diamonds %>% 
   count(cut, clarity) %>% 
   ggplot() +
     geom_tile(mapping = aes(x = cut, y = clarity, fill = n))

3.4.4.5 Exercise 9 - Make a heat map

Practice the method above by re-creating the heat map below.

diamonds %>% 
 count(color, cut) %>% 
   ggplot(mapping = aes(x = color, y = cut)) +
     geom_tile(mapping = aes(fill = n))

#### Recap

Boxplots, dotplots and violin plots provide an easy way to look for relationships between a continuous variable and a categorical variable. Violin plots convey a lot of information quickly, but boxplots have a head start in popularity — they were easy to use when statisticians had to draw graphs by hand.

In any of these graphs, look for distributions, ranges, medians, skewness or anything else that catches your eye to change in an unusual way from distribution to distribution. Often, you can make patterns even more revealing with the fct_reorder() function from the forcats package (we’ll wait to learn about forcats until after you study factors).

Count plots and heat maps help you see how observations are distributed across the interactions of two categorical variables.

3.5 Scatterplots

This tutorial revisits scatterplots, which display the relationship between two continuous variables. Along the way, you will learn to build multi-layer plots and to use new coordinate systems.

3.5.1 Welcome

A scatterplot displays the relationship between two continuous variables. Scatterplots are one of the most common types of graphs—in fact, you’ve met scatterplots already in Visualization Basics.

In this tutorial, you’ll learn how to:

  • Make new types of scatterplots with geom_text() and geom_jitter()
  • Add multiple layers of geoms to a plot
  • Enhance scatterplots with geom_smooth(), geom_rug(), and geom_repel()
  • Change the coordinate system of a plot

The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.

The tutorial uses the ggplot2, ggrepel, and dplyr packages, which have been pre-loaded for your convenience.

3.5.2 Scatterplots

3.5.2.1 Review 1 - geom_point()

In Visualization Basics, you learned how to make a scatterplot with geom_point().

The code below summarises the mpg data set and begins to plot the results. Finish the plot with geom_point(). Put mean_cty on the x axis and mean_hwy on the y axis.

mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot()
mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() + 
    geom_point(mapping = aes(x = mean_cty, y = mean_hwy))

"Good job! It can be tricky to remember when to use %>% and when to use +. Use %>% to add one complete step to a pipe of code. Use + to add one more line to a ggplot2 call."

3.5.2.2 geom_text()

geom_text() and geom_label() create scatterplots that use words instead of points to display data. Each requires the extra aesthetic label, which you should map to a variable that contains text to display for each observation.

Convert the plot below from geom_point() to geom_text() and map the label aesthetic to the class variable. When you are finished convert the code to geom_label() and rerun the plot. Can you spot the difference?

mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() +
    geom_point(mapping = aes(x = mean_cty, y = mean_hwy))

mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() +
    geom_text(mapping = aes(x = mean_cty, y = mean_hwy, label = class))

mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() +
    geom_label(mapping = aes(x = mean_cty, y = mean_hwy, label = class))

"Good job! geom_text() replaces each point with a piece of text supplied by the label aesthetic. geom_label replaces each point with a textbox. Notice that some pieces of text overlap each other, and others run off the page. We'll soon look at a way to fix this."

3.5.2.3 geom_smooth()

In Visualization Basics, you met geom_smooth(), which provides a summarised version of a scatterplot.

geom_smooth() uses a model to fit a smoothed line to the data and then visualizes the results. By default, geom_smooth() fits a loess smooth to data sets with less than 1,000 observations, and a generalized additive model to data sets with more than 1,000 observations.

image image

3.5.2.4 method

You can use the method parameter of geom_smooth() to fit and display other types of model lines. To do this, pass method the name of an R modeling function for geom_smooth() to use, such as lm (for linear models) or glm (for generalized linear models).

In the code below, use geom_smooth() to draw the linear model line that fits the data.

mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() 
mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() +
    geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

"Good job! Now let's look at a way to make geom_smooth() much more useful."

3.5.3 Layers

3.5.3.1 Add a layer

geom_smooth() becomes much more useful when you combine it with geom_point() to create a scatterplot that contains both:

  • raw data
  • a trend line

In ggplot2, you can add multiple geoms to a plot by adding multiple geom functions to the plot call. For example, the code below creates a plot that contains both points and a smooth line. Imagine what the results will look like in your head, and then run the code to see if you are right.

mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() +
    geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
    geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm) 
## `geom_smooth()` using formula = 'y ~ x'

"Good job! You can add as many geom functions as you like to a plot; but, in practice, a plot will become hard to interpret if it contains more than two or three geoms."

3.5.3.2 geom_label_repel()

Do you remember how the labels that we made early overlapped each other and ran off our graph? The geom_label_repel() geom from the ggrepel package mitigates these problems by using an algorithm to arrange labels within a plot. It works best in conjunction with a layer of points that displays the true location of each observation.

Use geom_label_repel() to add a new layer to our plot below. geom_label_repel() requires the same aesthetics as geom_label(): x, y, and label (here set to class).

mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() +
    geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
    geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm)
## `geom_smooth()` using formula = 'y ~ x'

library(ggrepel)
mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() +
    geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
    geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm) +
    geom_label_repel(mapping = aes(x = mean_cty, y = mean_hwy, label = class))
## `geom_smooth()` using formula = 'y ~ x'

mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() +
    geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
    geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm) +
    geom_text_repel(mapping = aes(x = mean_cty, y = mean_hwy, label = class))
## `geom_smooth()` using formula = 'y ~ x'

"Good job! The ggrepel package also provides geom_text_repel(), which is an analog for geom_text()."

3.5.3.3 Code duplication

If you study the solution for the previous exercise, you’ll notice a fair amount of duplication. We set the same aesthetic mappings in three different places.

mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() +
    geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
    geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm) +
    geom_label_repel(mapping = aes(x = mean_cty, y = mean_hwy, label = class))
## `geom_smooth()` using formula = 'y ~ x'

You should try to avoid duplication whenever you can in code because duplicated code invites typos, is hard to update, and takes longer than needed to write. Thankfully, ggplot2 provides a way to avoid duplication across multiple layers.

3.5.3.4 ggplot() mappings

You can set aesthetic mappings in two places within any ggplot2 call. You can set the mappings inside of a geom function, as we’ve been doing. Or you can set the mappings inside of the ggplot() function like below:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

3.5.3.5 Global vs. Local mappings

ggplot2 will treat any mappings set in the ggplot() function as global mappings. Each layer in the plot will inherit and use these mappings.

ggplot2 will treat any mappings set in a geom function as local mappings. Only the local layer will use these mappings. The mappings will override the global mappings if the two conflict, or add to them if they do not.

This system creates an efficient way to write plot calls:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(mapping = aes(color = class), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 5.6935
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.5065
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.65044
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 4.008
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.708
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.6135e-17
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.25

3.5.3.6 Exercise 2

Reduce duplication in the code below by moving as many local mappings into the global mappings as possible. Rerun the new code to ensure that it creates the same plot.

mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot() +
    geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
    geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm) +
    geom_label_repel(mapping = aes(x = mean_cty, y = mean_hwy, label = class))
## `geom_smooth()` using formula = 'y ~ x'

mpg %>% 
  group_by(class) %>% 
  summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>% 
  ggplot(mapping = aes(x = mean_cty, y = mean_hwy)) +
    geom_point() +
    geom_smooth(method = lm) +
    geom_label_repel(mapping = aes(label = class))
## `geom_smooth()` using formula = 'y ~ x'

#### Exercise 3 - Global vs. Local

Recreate the plot below in the most efficient way possible.

image

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(color = class)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

"Good Job!"

3.5.3.7 Global vs. Local data

The data argument also follows a global vs. local system. If you set the data argument of a geom function, the geom will use the data you supply instead of the data contained in ggplot(). This is a convenient way to highlight groups of points.

Use data arguments to recreate the plot below. I’ve started the code for you.

image

mpg2 <- filter(mpg, class == "2seater")
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_point(data = mpg2, color = "red", size = 2)

"Good Job!"

3.5.3.8 Exercise 4 - Global vs. Local data

Use data arguments to recreate the plot below.

image

mpg3 <- filter(mpg, hwy > 40)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_label_repel(data = mpg3, mapping = aes(label = class))

3.5.3.9 last_plot()

When exploring data, you’ll often make a plot and then think of a way to improve it. Instead of starting from scratch or copying and pasting your code, you can use ggplot2’s last_plot() function. last_plot() returns the most recent plot call, which makes it easy to build up a plot one layer at a time.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

last_plot() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

last_plot() +
  geom_smooth(method = lm, color = "purple")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#### Saving plots

If you’d like to work with a plot later, you can save it to an R object. Later you can display the plot or add to it, as if you were using last_plot().

p <- ggplot(data = mpg) +
  geom_point(mapping = aes(x = cty, y = hwy))

Notice that ggplot2 will not display a plot when you save it. It waits until you call the saved object.

p

3.5.3.10 geom_rug()

geom_rug() adds another type of summary to a plot. It uses displays the one dimensional marginal distributions of each variable in the scatterplot. These appear as collections of tickmarks along the x and y axes.

In the chunk below, use the faithful dataset to create a scatterplot that has the waiting variable on the x axis and the eruptions variable on the y axis. Use geom_rug() to add a rug plot to the scatterplot. Like geom_point(), geom_rug() requires x and y aesthetic mappings.

ggplot(data = faithful, mapping = aes(x = waiting, y = eruptions)) + 
  geom_point() + 
  geom_rug()

3.5.3.11 geom_jitter()

geom_jitter() plots a scatterplot and then adds a small amount of random noise to each point in the plot. It is a shortcut for adding a “jitter” position adjustment to a points plot (i.e, geom_point(position = “jitter”)).

Why would you use geom_jitter()? Jittering provides a simple way to inspect patterns that occur in heavily gridded or overlapping data. To see what I mean, replace geom_point() with geom_jitter() in the plot below.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = class, y = hwy))

ggplot(data = mpg) +
  geom_jitter(mapping = aes(x = class, y = hwy))

"Good job! You can also jitter in only a single direction. To turn off jittering in the x direction set width = 0 in geom_jitter(). To turn off jittering in the y direction, set height = 0."

3.5.3.12 jitter and boxplots

geom_jitter() provides a convenient way to overlay raw data on boxplots, which display summary information.

Use the chunk below to create a boxplot of the previous graph. Arrange for the outliers to have an alpha of 0, which will make them completely transparent. Then add a layer of points that are jittered in y direction, but not the x direction.

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot(outlier.alpha = 0) + 
  geom_jitter(width = 0)

3.5.4 Coordinate Systems

3.5.4.1 coord_flip()

One way to customize a scatterplot is to plot it in a new coordinate system. ggplot2 provides several helper functions that change the coordinate system of a plot. You’ve already seen one of these in action in the boxplots tutorial: coord_flip() flips the x and y axes of a plot.

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot(outlier.alpha = 0) +
  geom_jitter(width = 0) +
  coord_flip()

#### The coord functions

Altogether, ggplot2 comes with seven coord functions:

  • coord_cartesian() - (the default) Cartesian coordinates
  • coord_fixed() - Cartesian coordinates that maintain a fixed aspect ratio as the plot window is resized
  • coord_flip() - Cartesian coordinates with x and y axes flipped
  • coord_map() and coord_quickmap() - cartographic projections for plotting maps
  • coord_polar() - polar coordinates
  • coord_trans() - transformed Cartesian coordinates

By default, ggplot2 will draw a plot in Cartesian coordinates unless you add one of the functions above to the plot code.

3.5.4.2 coord_polar()

You use each coord function like you use coord_flip(), by adding it to a ggplot2 call.

So for example, you could add coord_polar() to a plot to make a graph that uses polar coordinates.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut), width = 1) 

last_plot() +
  coord_polar()

3.5.4.3 Coordinate systems and scatterplots

How can a coordinate system improve a scatterplot?

Consider, the scatterplot below. It shows a strong relationship between the carat size of a diamond and its price.

image However, the relationship does not appear linear. It appears to have the form y=xn, a common relationship found in nature. You can estimate the n by replotting the data in a log-log plot.

3.5.4.4 log-log plots

Log-log plots graph the log of x vs. the log of y, which has a valuable visual effect. If you log both sides of a relationship like

\[y = x^n\]

You get a linear relationship with slope n:

\[\log(y) = \log(x^n)\] \[\log(y)= n\cdot \log(x)\]

In other words, log-log plots unbend power relationships into straight lines. Moreover, they display n as the slope of the straight line, which is reasonably easy to estimate.

Try this by using the diamonds dataset to plot log(carat) against log(price).

ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = log(price), y = log(carat)))

#### coord_trans()

coord_trans() provides a second way to do the same transformation, or similar transformations.

To use coord_trans() give it an x and/or a y argument. Set each to the name of an R function surrounded by quotation marks. coord_trans() will use the function to transform the specified axis before plotting the raw data.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price)) +
  coord_trans(x = "log", y = "log")

3.5.4.5 Recap

Scatterplots are one of the most useful types of plots for data science. You will have many chances to use geom_point(), geom_smooth(), and geom_label_repel() in your day to day work.

However, this tutor introduced important two concepts that apply to more than just scatterplots:

  • You can add multiple layers to any plot that you make with ggplot2
  • You can add a different coordinate system to any plot that you make with ggplot2

3.6 Line Plots and Maps

Learn to connect data points to make line plots, polygon plots, and even maps.

3.6.1 Welcome

A line graph displays a functional relationship between two continuous variables. A map displays spatial data. The two may seem different, but they are made in similar ways. This tutorial will examine them both.

In this tutorial, you’ll learn how to:

  • Make new types of line plots with geom_step(), geom_area(), geom_path(), and geom_polygon()
  • Avoid “whipsawing” with the group aesthetic
  • Find and plot map data with geom_map()
  • Transform a coordinate system into a map projection with coord_map()

The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.

The tutorial uses the ggplot2, maps, mapproj, and dplyr packages, which have been pre-loaded for your convenience.

3.6.2 Line graphs

3.6.2.1 Line Graph vs. Scatterplot

Like scatterplots, line graphs display the relationship between two continuous variables. However, unlike scatterplots, line graphs expect the variables to have a functional relationship, where each value of x is associated with only one value of y.

For example, in the plot below, there is only one value of unemploy for each value of date.

image #### geom_line()

Use the geom_line() function to make line graphs. Like geom_point(), it requires x and y aesthetics.

Use geom_line() in the chunk below to recreate the graph above. The graph uses the economics dataset that comes with ggplot2 and maps the date and unemploy variables to the x and y axes. See Visualization Basics if you are completely stuck.

ggplot(data = economics) + 
  geom_line(mapping = aes(x = date, y = unemploy))

"Good Job! The graph shows the number of unemployed people in the US (in thousands) from 1967 to 2015. Now let's look at a more rich dataset."

3.6.2.2 asia

I’ve used the gapminder package to assemble a new data set named asia to plot. Among other things, asia contains the per capita GDP of four countries from 1952 to 2007.

The following code uses gapminder package: https://CRAN.R-project.org/package=gapminder

library(gapminder)
gapminder
## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows
unique(filter(gapminder, continent == "Asia")$country)
##  [1] Afghanistan        Bahrain            Bangladesh         Cambodia          
##  [5] China              Hong Kong, China   India              Indonesia         
##  [9] Iran               Iraq               Israel             Japan             
## [13] Jordan             Korea, Dem. Rep.   Korea, Rep.        Kuwait            
## [17] Lebanon            Malaysia           Mongolia           Myanmar           
## [21] Nepal              Oman               Pakistan           Philippines       
## [25] Saudi Arabia       Singapore          Sri Lanka          Syria             
## [29] Taiwan             Thailand           Vietnam            West Bank and Gaza
## [33] Yemen, Rep.       
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe
Asia <- filter(gapminder, country %in% c("China", "Japan", "Korea, Dem. Rep.", "Korea, Rep."))
asia <- Asia %>% 
  mutate(country = case_when(country =="Korea, Dem. Rep." ~ "North Korea",
                                            country == "Korea, Rep." ~ "South Korea",
                                            TRUE ~ as.character(country)))
asia
## # A tibble: 48 × 6
##    country continent  year lifeExp        pop gdpPercap
##    <chr>   <fct>     <int>   <dbl>      <int>     <dbl>
##  1 China   Asia       1952    44    556263527      400.
##  2 China   Asia       1957    50.5  637408000      576.
##  3 China   Asia       1962    44.5  665770000      488.
##  4 China   Asia       1967    58.4  754550000      613.
##  5 China   Asia       1972    63.1  862030000      677.
##  6 China   Asia       1977    64.0  943455000      741.
##  7 China   Asia       1982    65.5 1000281000      962.
##  8 China   Asia       1987    67.3 1084035000     1379.
##  9 China   Asia       1992    68.7 1164970000     1656.
## 10 China   Asia       1997    70.4 1230075000     2289.
## # … with 38 more rows

3.6.2.3 whipsawing

However, when we plot the asia data we get an odd looking graph. The line seems to “whipsaw” up and down. Whipsawing is one of the most encountered challenges with line graphs.

ggplot(asia) +
  geom_line(mapping = aes(x = year, y = gdpPercap))

3.6.2.4 Review 1 - Whipsawing

You’ve encountered whipsawing before in the Data Basics tutorial. What does whipsawing indicate?

  • There is a lot of volatility in the data. ✗
  • The graph should be plotted in polar coordinates. ✗
  • The data contains rounding errors. ✗
  • We are trying to plot more than one line with a single line. ✓
Correct!

As a result, our single line needs to connect multiple points for each x value before moving to the next x value.
ggplot(asia) +
  geom_line(mapping = aes(x = year, y = gdpPercap))

"Good job! There are actually four lines in the plot. One for each country: China, Japan, North Korea, and South Korea."

3.6.2.5 group

Many geoms, like lines, boxplots, and smooth lines, use a single object to display the entire dataset. You can use the group aesthetic to instruct these geoms to draw separate objects for different groups of observations.

For example, in the code below, you can map group to the grouping variable country to create a separate line for each country. Try it. Be sure to place the group mapping inside of aes().

ggplot(asia) +
  geom_line(mapping = aes(x = year, y = gdpPercap))

ggplot(asia) +
  geom_line(mapping = aes(x = year, y = gdpPercap, group = country))

"Good job! We now have a separate line for each country. Unfortunately, we cannot tell what the countries are: the group aesthetic does not supply a legend. Let's look at how to fix that."

3.6.2.6 aesthetics

You do not have to rely on the group aesthetic to perform a grouping. ggplot2 will automatically group a monolithic geom whenever you map an aesthetic to a categorical variable.

So for example, the code below performs an implied grouping. And since we use the color aesthetic, the plot includes the color legend.

ggplot(asia) +
  geom_line(mapping = aes(x = year, y = gdpPercap, color = country))

#### linetype

Lines recognize a useful aesthetic that we haven’t encountered before, linetype. Change color to linetype below and inspect the results. What happens if you map both a color and a linetype to country?

ggplot(asia) +
  geom_line(mapping = aes(x = year, y = gdpPercap, color = country))

ggplot(asia) +
  geom_line(mapping = aes(x = year, y = gdpPercap, color = country, linetype = country))

"Good job! If you map two aesthetics to the same variable, ggplot2 will combine their legends. Supplementing color with linetype is a good idea if you might print your line chart in black and white."

3.6.2.7 Exercise 1 - Life Expectancy

Use what you’ve learned to plot the life expectancy of each country over time. Life expectancy is saved in the asia data set as lifeExp. Which country has the highest life expectancy? The lowest?

ggplot(asia) +
  geom_line(mapping = aes(x = year, y = lifeExp, color = country, linetype = country))

3.6.3 Similar geoms

3.6.3.1 geom_step()

geom_step() draws a line chart in a stepwise fashion. To see what I mean, change the geom in the plot below and rerun the code.

ggplot(asia) +
  geom_line(mapping = aes(x = year, y = lifeExp, color = country, linetype = country))

ggplot(asia) +
  geom_step(mapping = aes(x = year, y = lifeExp, color = country, linetype = country))

'Good job! You can control whether the steps move horizontally first and then vertically or vertically first and then horizontally with the parameters `direction = "hv"` (the default) or `direction = "vh"`.'

3.6.3.2 geom_area()

geom_area() is similar to a line graph, but it fills in the area under the line. To see geom_area() in action, change the geom in the plot below and rerun the code.

ggplot(economics) +
  geom_line(mapping = aes(x = date, y = unemploy))

ggplot(economics) +
  geom_area(mapping = aes(x = date, y = unemploy))

3.6.3.3 Review 2 - Set vs. Map

Do you recall from Visualization Basics how you would set the fill of our plot to blue (instead of, say, map the fill to a variable)? Give it a try.

ggplot(economics) +
  geom_area(mapping = aes(x = date, y = unemploy))

ggplot(economics) +
  geom_area(mapping = aes(x = date, y = unemploy),fill = "blue")

3.6.3.4 Accumulation

geom_area() is a great choice if your measurements represent the accumulation of objects (like unemployed people). Notice that the y axis geom_area() always begins or ends at zero.

Perhaps because of this, geom_area() can be quirky when you have multiple groups. Run the code below. Can you tell what happens here?

ggplot(asia) +
  geom_area(mapping = aes(x = year, y = lifeExp, fill = country))

3.6.3.5 Review 3 - Position adjustments

If you answered that people in China were living to be 300 years old, you guessed wrong.

geom_area() is stacking each group above the group below. As a result, the line that should display the life expectancy for China displays the combined life expectancy for all countries.

You can fix this by changing the position adjustment for geom_area(). Give it a try below. Change the position parameter from “stack” (the implied default) to “identity”. See Bar Charts if you’d like to learn more about position adjustments.

ggplot(asia) +
  geom_area(mapping = aes(x = year, y = lifeExp, fill = country), alpha = 0.3)

ggplot(asia) +
  geom_area(mapping = aes(x = year, y = lifeExp, fill = country), position = "identity", alpha = 0.3)

"Good Job! You can further customize your graph by switching from `geom_area()` to `geom_ribbon()`. `geom_ribbon()` lets you map the bottom of the filled area to a variable, as well as the top. See `?geom_ribbon` if you'd like to learn more."

3.6.3.6 geom_path()

geom_line() comes with a strange bed-fellow, geom_path(). geom_path() draws a line between points like geom_line(), but instead of connecting points in the order that they appear along the x axis, geom_path() connects the points in the order that they appear in the data set.

It starts with the observation in row one of the data and connects it to the observation in row two, which it then connects to the observation in row three, and so on.

3.6.3.7 geom_path() example

To see how geom_path() does this, let’s rearrange the rows in the economics dataset. We can reorder them by unemploy value. Now the data set will begin with the observation that had the lowest value of unemploy.

economics2 <- economics %>% 
  arrange(unemploy)
economics2
## # A tibble: 574 × 6
##    date         pce    pop psavert uempmed unemploy
##    <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
##  1 1968-12-01  576. 201621    11.1     4.4     2685
##  2 1968-09-01  568. 201095    10.6     4.6     2686
##  3 1968-10-01  572. 201290    10.8     4.8     2689
##  4 1969-02-01  589. 201881     9.7     4.9     2692
##  5 1968-04-01  544  200208    12.3     4.6     2709
##  6 1969-03-01  589. 202023    10.2     4       2712
##  7 1969-05-01  600. 202331    10.1     4.2     2713
##  8 1968-11-01  577. 201466    10.6     4.4     2715
##  9 1969-01-01  584. 201760    10.3     4.4     2718
## 10 1968-05-01  550. 200361    12       4.4     2740
## # … with 564 more rows

3.6.3.8 geom_path() example continued

If we plot the reordered data with both geom_line() and geom_path() we get two very different graphs.

ggplot(economics2) +
  geom_line(mapping = aes(x = date, y = unemploy))

ggplot(economics2) +
  geom_path(mapping = aes(x = date, y = unemploy))

The plot on the left uses geom_line(), hence the points are connected in order along the x axis. The plot on the right uses geom_path(). These points are connected in the order that they appear in the dataset, which happens to put them in order along the y axis.

3.6.3.9 A use case

Why would you want to use geom_path()? The code below illustrates one particularly useful case. The tx dataset contains latitude and longitude coordinates saved in a specific order.

library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
tx <- map_data("state", region = "texas")
tx
## # A tibble: 1,088 × 6
##     long   lat group order region subregion
##    <dbl> <dbl> <dbl> <int> <chr>  <chr>    
##  1 -94.5  33.7     1     1 texas  <NA>     
##  2 -94.5  33.7     1     2 texas  <NA>     
##  3 -94.5  33.6     1     3 texas  <NA>     
##  4 -94.5  33.6     1     4 texas  <NA>     
##  5 -94.5  33.6     1     5 texas  <NA>     
##  6 -94.4  33.6     1     6 texas  <NA>     
##  7 -94.4  33.6     1     7 texas  <NA>     
##  8 -94.4  33.6     1     8 texas  <NA>     
##  9 -94.4  33.6     1     9 texas  <NA>     
## 10 -94.3  33.6     1    10 texas  <NA>     
## # … with 1,078 more rows
ggplot(tx) +
  geom_path(mapping = aes(x = long, y = lat))

image

"Good job! `geom_path()` reveals how you can use what is essentially a line plot to make a map (this is a map of the state of Texas). There are other ways to make maps in R, but this low tech method is surprisingly versatile."

3.6.3.10 geom_polygon()

geom_polygon() extends geom_path() one step further: it connects the last point to the first and then colors the interior region with a fill. The result is a polygon.

ggplot(tx) +
  geom_polygon(mapping = aes(x = long, y = lat))

image

3.6.3.11 Exercise 2 - Shattered Glass

What do you think went wrong in the plot of Texas below?

image What went wrong?

  • The rows in the dataset became out of order. ✓
  • The programmer did not set a fill aesthetic. ✗
  • The programmer used a line plot instead of a polygon plot. ✗
Correct!

It looks like someone messed with tx. tx and datasets like it will have an order variable that you can use to ensure that the data is in the correct order before you plot it.

3.6.4 Maps

3.6.4.1 maps

The tx data set comes from the maps package, which is an R package that contains similarly formatted data sets for many regions of the globe.

A short list of the datasets saved in maps includes: france, italy, nz, usa, world, and world2, along with county and state. These last two map the US at the county and state levels. To learn more about maps, run help(package = maps).

3.6.4.2 map_data

You do not need to access the maps package to use its data. ggplot2 provides the function map_data() which fetches maps from the maps package and returns them in a format that ggplot2 can plot.

3.6.4.3 map_data syntax

To use map_data() give it the name of a dataset to retrieve. You can retrieve a subset of the data by providing an optional region argument. For example, I can use this code to retrieve a map of Florida from state, which is the dataset that contains all 50 US states.

fl <- map_data("state", region = "florida")
ggplot(fl) +
  geom_polygon(mapping = aes(x = long, y = lat))

Alter the code to retrieve and plot your home state (Try Idaho if you are outside of the US). Notice the capitalization.

library(maps)
id <- map_data("state", region = "idaho")
ggplot(id) +
  geom_polygon(mapping = aes(x = long, y = lat))

3.6.4.4 state

If you do not specify a region, map_data() will retrieve the entire data set, in this case state.

us <- map_data("state")

In practice, you will often have to retrieve an entire dataset at least once to learn what region names to use with map_data(). The names will be stored in the region column of the dataset.

3.6.4.5 Hmmm

The code below retrieves and plots the entire state data set, but something goes wrong. What?

us <- map_data("state")
ggplot(us) +
  geom_polygon(mapping = aes(x = long, y = lat))

3.6.4.6 Multiple polygons

In this case, our data is not out of order, but it contains more than one polygon: it contains 50 polygons—one for each state.

By default, geom_polygon() tries to plot a single polygon, which causes it to connect multiple polygons in weird ways.

  • Example of a map of the world.
map('world', fill = TRUE, col = 1:10, wrap=c(-180,180) )

3.6.4.7 groups

Which aesthetic can you use to plot multiple polygons? In the code below, map the aesthetic to the group variable in the state dataset. This variable contains all of the grouping information needed to make a coherent map. Then rerun the code.

ggplot(us) +
  geom_polygon(mapping = aes(x = long, y = lat))

ggplot(us) +
  geom_polygon(mapping = aes(x = long, y = lat, group = group))

3.6.4.8 USArrests

R comes with a data set named USArrests that we can use in conjunction with our plot above to make a choropleth map. A choropleth map uses the color of each region in the plot to display some value associated with the region.

In our case we will use the UrbanPop variable of USAarrests which records how urbanized each state was in 1973. UrbanPop is the percent of the population who lived within a city.

USArrests
## # A tibble: 50 × 4
##    Murder Assault UrbanPop  Rape
##     <dbl>   <int>    <int> <dbl>
##  1   13.2     236       58  21.2
##  2   10       263       48  44.5
##  3    8.1     294       80  31  
##  4    8.8     190       50  19.5
##  5    9       276       91  40.6
##  6    7.9     204       78  38.7
##  7    3.3     110       77  11.1
##  8    5.9     238       72  15.8
##  9   15.4     335       80  31.9
## 10   17.4     211       60  25.8
## # … with 40 more rows

3.6.4.9 geom_map()

You can use geom_map() to create choropleth maps. geom_map() pairs a data frame like USArrests with a map dataset like us by matching region names.

3.6.4.10 Data wrangling

To use geom_map(), we first need to ensure that a common set of region names appears across both datasets.

At the moment, this isn’t the case. USArrests uses capitalized state names and hides them outside of the dataset in the row names (instead of in a column). In contrast, us uses a column of lower case state names. The code below fixes this.

USArrests2 <- USArrests %>% 
  rownames_to_column("region") %>% 
  mutate(region = tolower(region))

USArrests2
## # A tibble: 50 × 5
##    region      Murder Assault UrbanPop  Rape
##    <chr>        <dbl>   <int>    <int> <dbl>
##  1 alabama       13.2     236       58  21.2
##  2 alaska        10       263       48  44.5
##  3 arizona        8.1     294       80  31  
##  4 arkansas       8.8     190       50  19.5
##  5 california     9       276       91  40.6
##  6 colorado       7.9     204       78  38.7
##  7 connecticut    3.3     110       77  11.1
##  8 delaware       5.9     238       72  15.8
##  9 florida       15.4     335       80  31.9
## 10 georgia       17.4     211       60  25.8
## # … with 40 more rows

3.6.4.11 geom_map() syntax

To use geom_map():

  1. Initialize a plot with the data set that contains your data. Here that is USArrests2.

  2. Add geom_map(). Set the map_id aesthetic to the variable that contains the regions names. Then set the fill aesthetic to the fill variable. You do not need to supply x and y aesthetics, geom_map() will derive these values from the map data set, which you must set with the map parameter. Since map is a parameter, it should go outside the aes() function.

  3. Follow geom_map() with expand_limits(), and tell expand_limits() what the x and y variables in the map dataset are. This shouldn’t be necessary in future iterations of geom_map(), but for now ggplot2 will use the x and y arguments of expand_limits() to build the bounding box for your plot.

ggplot(USArrests2) +
  geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
  expand_limits(x = us$long, y = us$lat)

“Congratulations! You’ve used geom_map() to make your first choropleth plot! To test your understanding, alter the code to display the Murder, Assault, or Rape variables.”

3.6.4.12 coord_map()

You may have noticed that our maps look a little off. So far, we’ve plotted them in Cartesian coordinates, which distort the spherical surface described by latitude and longitude. Also, ggplot2 adjusts the aspect ratio of our plots to fit our graphing window, which can further distort our maps.

You can avoid both of these distortions by adding coord_map() to your plot. coord_map() displays the plot in a fixed cartographic projection. Note that coord_map(), relies on the mapproj package, so you’ll need to have mapproj installed before you use coord_map().

library(mapproj)

ggplot(USArrests2) +
  geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
  expand_limits(x = us$long, y = us$lat) +
  coord_map()

3.6.4.13 projections

By default, coord_map() replaces the coordinate system with a Mercator projection. To use a different projection, set the projection argument of coord_map() to a projection name, surrounded by quotation marks.

To see this, extend the code below to view the map in a “sinusoidal” projection.

ggplot(USArrests2) +
  geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
  expand_limits(x = us$long, y = us$lat)

ggplot(USArrests2) +
  geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
  expand_limits(x = us$long, y = us$lat) +
  coord_map(projection = "sinusoidal")

3.6.4.14 Recap

You can now make all of the plots recommended in the Exploratory Data Analysis tutorial. The next tutorial in this primer will teach you several strategies for dealing with overplotting, a problem that can occur when you have large data or low resolution data.

3.7 Overplotting and Big Data

Here you will learn to handle a problem that occur when graphing data—especially large data. Along the way, you will meet several new geoms.

3.7.1 Welcome

Data Visualization is a useful tool because it makes data accessible to your visual system, which can process large amounts of information quickly. However, two characteristics of data can short circuit this system. Data can not be easily visualized if

  1. Data points are all rounded to the same values.
  2. The data contains so many points that they occlude each other.

These features both create overplotting, the condition where multiple geoms in the plot are plotted on top of each other, hiding each other. This tutorial will show you several strategies for dealing with overplotting, introducing new geoms along the way.

The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.

The tutorial uses the ggplot2 and hexbin packages, which have been pre-loaded for your convenience.

3.7.2 Overplotting

3.7.2.1 What is overplotting?

You’ve seen this plot several times in previous tutorials, but have you noticed that it only displays 126 points? This is unusual because the plot visualizes a data set that contains 234 points.

image The missing points are hidden behind other points, a phenomenon known as overplotting. Overplotting is a problem because it provides an incomplete picture of the dataset. You cannot determine where the mass of the points fall, which makes it difficult to spot relationships in the data.

3.7.2.2 Causes of overplotting

Overplotting usually occurs for two different reasons:

  1. The data points have been rounded to a “grid” of common values, as in the plot above
  2. The dataset is so large that it cannot be plotted without points overlapping each other

How you deal with overplotting will depend on the cause.

3.7.3 Rounding

3.7.3.1 Overplotting due to rounding

If your overplotting is due to rounding, you can obtain a better picture of the data by making each point semi-transparent. For example you could set the alpha aesthetic of the plot below to a value less than one, which will make the points transparent.

Try this now. Set the points to an alpha of 0.25, which will make each point 25% opague (i.e. four points staked on top of each other will create a solid black).

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), alpha = 0.25)

"Good job! You can now identify which values contain more observations. The darker locations contain several points stacked on top of each other."

3.7.3.2 Adjust the position

A second strategy for dealing with rounding is to adjust the position of each point. position = “jitter” adds a small amount of random noise to the location of each point. Since the noise is random, it is unlikely that two points rounded to the same location will also be jittered to the same location.

The result is a jittered plot that displays more of the data. Jittering comes with both limitations and benefits. You cannot use a jittered plot to see the local values of the points, but you can use a jittered plot to perceive the global relationship between the variables, something that is hard to do in the presence of overplotting.

image #### Review - jitter

In the Scatterplots tutorial, you learned of a geom that displays the equivalent of geom_point() with a position = “jitter” adjustment.

Rewrite the code below to use that geom. Do you obtain similar results?

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

ggplot(data = mpg) +
  geom_jitter(mapping = aes(x = displ, y = hwy))

"Good job! You can now identify which values contain more observations. The darker locations contain several points stacked on top of each other."

3.7.4 Large Data

3.7.4.1 Overplotting due to large data

A dataset does not need to be truly “Big Data” to be hard to visualize. The diamonds data set contains less than 54,000 points, but it still suffers from overplotting when you try to plot carat vs. price. Here the bulk of the points fall on top of each other in an impenetrable cloud of blackness.

image

3.7.4.2 Strategies for large data

Alpha and jittering are less useful for large data. Jittering will not separate the points, and a mass of transparent points can still look black.

A better way to deal with overplotting due to large data is to visualize a summary of the data. In fact, we’ve already worked with this dataset by using geoms that naturally summarise the data, like geom_histogram() and geom_smooth().

image image Let’s look at several other geoms that you can use to summarise relationships in large data.

3.7.4.3 Review - Boxplots with continuous variables

Boxplots efficiently summarise data, which make them a useful tool for large data sets. In the boxplots tutorial, you learned how to use cut_width() and the group aesthetic to plot multiple boxplots for a continuous variable.

Modify the code below to cut the carat axis into intervals with width 0.2. Then set the group aesthetic of geom_boxplot() to the result.

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = carat, y = price))
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = carat, y = price, group = cut_width(carat, width = 0.2)))

"Good job! The medians of the boxplots give a somewhat more precise description of the relationship between carat and price than does the fan of individual points."

3.7.4.4 geom_bin2d()

geom_bin2d() provides a new way to summarise two dimensional continuous relationships. You can think of bin2d as working like a three dimensional histogram. It divides the Cartesian field into small rectangular bins, like a checkerboard. It then counts how many points fall into each bin, and maps the count to color. Bins that contain no points are left blank.

image

By studying the results, we can see that the mass of points falls in the bottom left of the graph.

3.7.4.5 Exercise - binwidths

Like histograms, bin2d use bins and binwidth arguments. Each should be set to a vector of two numbers: one for the number of bins (or binwidths) to use on the x axis, and one for the number of bins (or binwidths) to use on the y axis.

Use one of these parameters to modify the graph below to use 40 bins on the x axis and 50 on the y axis.

ggplot(data = diamonds) +
  geom_bin2d(mapping = aes(x = carat, y = price))

ggplot(data = diamonds) +
  geom_bin2d(mapping = aes(x = carat, y = price), bins = c(40, 50))

"Good job! As with histograms, bin2ds can reveal different information at different binwidths."

3.7.4.6 Exercise - geom_hex()

Our eyes are drawn to straight vertical and horizontal lines, which makes it easy to perceive “edges” in a bin2d that are not necessarily there (the rectangular bins naturally form edges that span the breadth of the graph).

One way to avoid this, if you like, is to use geom_hex(). geom_hex() functions like geom_bin2d() but uses hexagonal bins. Adjust the graph below to use geom_hex().

ggplot(data = diamonds) +
  geom_bin2d(mapping = aes(x = carat, y = price))

ggplot(data = diamonds) +
  geom_hex(mapping = aes(x = carat, y = price))

3.7.4.7 geom_density2d()

geom_density2d() provides one last way to summarize a two dimensional continuous relationship. Think of density2d as the two dimensional analog of density. Instead of drawing a line that rises and falls on the y dimension, it draws a field over the coordinate axes that rises and falls on the z dimension, that’s the dimension that points straight out of the graph towards you.

The result is similar to a mountain that you are looking straight down upon. The high places on the mountain show where the most points fall and the low places show where the fewest points fall. To visualize this mountain, density2d draws contour lines that connect areas with the same “height”, just like a contour map draws elevation.

Here we see the “ridge” of points that occur at low values of carat and price.

image

3.7.4.8 Expand limits

By default, density2d zooms in on the region that contains density lines. This may not be the same region spanned by the data points. If you like, you can re-expand the graph to the region spanned by the price and carat variables with expand_limits().

expand_limits() zooms the x and y axes to the fit the range of any two variables (they need not be the original x and y variables).

image #### Exercise - density2d

Often density2d plots are easiest to read when you plot them on top of the original data. In the chunk below create a plot of diamond carat size vs. price. The plot should contain density2d lines superimposed on top of the raw points. Make the raw points transparent with an alpha of 0.1.

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_point(alpha = 0.1) + 
  geom_density_2d()

"Good job! Plotting a summay on top of raw values is a common pattern in data science."

3.7.4.9 Recap

Overplotting is a common phenomenon in plots because the causes of overplotting area common phenomenon in data sets. Data sets often

  • round values to a common set of values, or
  • are too big to visualize easily without overplotting

When overplotting results from rounding errors, you can work around it by manipulating the transparency or location of the points.

For larger datasets you can use geoms that summarise the data to display relationships without overplotting. This is an effective tactic for truly big data as well, and it also works for the first case of overplotting due to rounding.

One final tactic is to sample your data to create a sample data set that is small enough to visualize without overplotting.

You’ve now learned a complete toolkit for exploring data visually. The final tutorial in this primer will show you how to polish the plots you make for publication. Instead of learning how to visualize data, you will learn how to add titles and captions, customize color schemes and more.

3.8 Customize Your Plots

Learn to adjust color schemes, titles, legends, and more to make your plots perfect for publication.

3.8.1 Welcome

This tutorial will teach you how to customize the look and feel of your plots. You will learn how to:

  • Zoom in on areas of interest
  • Add labels and annotations to your plots
  • Change the appearance of your plot with a theme
  • Use scales to select custom color palettes
  • Modify the labels, title, and position of legends

The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.

The tutorial uses the ggplot2, dplyr, scales, ggthemes, and viridis packages, which have been pre-loaded for your convenience.

3.8.2 Zooming

n the previous tutorials, you learned how to visualize data with graphs. Now let’s look at how to customize the look and feel of your graphs. To do that we will need to begin with a graph that we can customize.

3.8.2.1 Review 1 - Make a plot

In the chunk below, make a plot that uses boxplots to display the relationship between the cut and price variables from the diamonds dataset.

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = cut, y = price))

3.8.2.2 Storing plots

Since we want to use this plot again later, let’s go ahead and save it.

p <- ggplot(diamonds) +
  geom_boxplot(mapping = aes(x = cut, y = price))

Now whenever you call p, R will draw your plot. Try it and see.

p

"Good job! By the way, have you taken a moment to look at what the plot shows? Let's do that now."

3.8.2.3 Surprise?

Our plot shows something surprising: when you group diamonds by cut, the worst cut diamonds have the highest median price. It’s a little hard to see in the plot, but you can verify it with some data manipulation.

diamonds %>% 
  group_by(cut) %>% 
  summarise(median = median(price))
## # A tibble: 5 × 2
##   cut       median
##   <ord>      <dbl>
## 1 Fair       3282 
## 2 Good       3050.
## 3 Very Good  2648 
## 4 Premium    3185 
## 5 Ideal      1810

3.8.2.4 Zoom

image The difference between median prices is hard to see in our plot because each group contains distant outliers.

We can make the difference easier to see by zooming in on the low values of y, where the medians are located. There are two ways to zoom with ggplot2: with and without clipping.

3.8.2.5 Clipping

Clipping refers to how R should treat the data that falls outside of the zoomed region. To see its effect, look at these plots. Each zooms in on the region where price is between $0 and $7,500.

imaage image * The plot on the left zooms by clipping. It removes all of the data points that fall outside of the desired region, and then plots the data points that remain. * The plot on the right zooms without clipping. You can think of it as drawing the entire graph and then zooming into a certain region.

3.8.2.6 xlim() and ylim()

Of these, zooming by clipping is the easiest to do. To zoom your graph on the x axis, add the function xlim() to the plot call. To zoom on the y axis add the function ylim(). Each takes a minimum value and a maximum value to zoom to, like this

some_plot +
  xlim(0, 100)

3.8.2.7 Exercise 1 - Clipping

Use ylim() to recreate our plot on the left from above. The plot zooms the y axis from 0 to 7,500 by clipping.

p

p + ylim(0, 7500)
## Warning: Removed 8382 rows containing non-finite values (`stat_boxplot()`).

3.8.2.8 A caution

Zooming by clipping is a bad idea for boxplots. ylim() fundamentally changes the information conveyed in the boxplots because it throws out some of the data before drawing the boxplots. Those aren’t the medians of the entire data set that we are looking at.

How then can we zoom without clipping?

3.8.2.9 xlim and ylim

To zoom without clipping, set the xlim and/or ylim arguments of your plot’s coord_ function. Each takes a numeric vector of length two (the minimum and maximum values to zoom to).

This is easy to do if your plot explicitly calls a coord_ function

p + coord_flip(ylim = c(0, 7500))

3.8.2.10 coord_cartesian()

But what if your plot doesn’t call a coord_ function? Then your plot is using Cartesian coordinates (the default). You can adjust the limits of your plot without changing the default coordinate system by adding coord_cartesian() to your plot.

Try it below. Use coord_cartesian() to zoom p to the region where price falls between 0 and 7500.

p + coord_cartesian(ylim = c(0, 7500))

"Good job! Now it is much easier to see the differences in the median."

3.8.2.11 p

Notice that our code so far has used p to make a plot, but it hasn’t changed the plot that is saved inside of p. You can run p by itself to get the unzoomed plot.

p

3.8.2.12 Updating p

I like the zooming, so I’m purposefully going to overwrite the plot stored in p so that it uses it.

p <- p + coord_cartesian(ylim = c(0, 7500))
p

3.8.3 Labels

3.8.3.1 labs()

The relationship in our plot is now easier to see, but that doesn’t mean that everyone who sees our plot will spot it. We can draw their attention to the relationship with a label, like a title or a caption.

To do this, we will use the labs() function. You can think of labs() as an all purpose function for adding labels to a ggplot2 plot.

3.8.3.2 Titles

Give labs() a title argument to add a title.

p + labs(title = "The title appears here")

3.8.3.3 Subtitles

Give labs() a subtitle argument to add a subtitle. If you use multiple arguments, remember to separate them with a comma.

p + labs(title = "The title appears here",
         subtitle = "The subtitle appears here, slightly smaller")

3.8.3.4 Captions

Give labs() a caption argument to add a caption. I like to use captions to cite my data source.

p + labs(title = "The title appears here",
         subtitle = "The subtitle appears here, slightly smaller",
         caption = "Captions appear at the bottom.")

3.8.3.5 Exercise 2 - Labels

Plot p with a set of informative labels. for learning purposes, be sure to use a title, subtitle, and caption.

p + labs(title = "Diamond prices by cut",
         subtitle = "Fair cut diamonds fetch the highest median price. Why?",
         caption = "Data collected by Hadley Wickham")

"Good job! By the way, why *do* fair cut diamonds fetch the highest price?"

3.8.3.6 Exercise 3 - Carat size?

Perhaps a diamond’s cut is conflated with its carat size. If fair cut diamonds tend to be larger diamonds that would explain their larger prices. Let’s test this.

Make a plot that displays the relationship between carat size, price, and cut for all diamonds. How do you interpret the results? Give your plot a title, subtitle, and caption that explain the plot and convey your conclusions.

If you are looking for a way to start, I recommend using a smooth line with color mapped to cut, perhaps overlaid on the background data.

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_smooth(mapping = aes(color = cut), se = FALSE) + 
  labs(title = "Carat size vs. Price",
       subtitle = "Fair cut diamonds tend to be large, but they fetch the lowest prices for most carat sizes.",
       caption = "Data by Hadley Wickham")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

3.8.3.7 p1

Unlike p, our new plot uses color and has a legend. Let’s save it to use later when we learn to customize colors and legends.

p1 <- ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_smooth(mapping = aes(color = cut), se = FALSE) + 
  labs(title = "Carat size vs. Price",
       subtitle = "Fair cut diamonds tend to be large, but they fetch the lowest prices for most carat sizes.",
       caption = "Data by Hadley Wickham")

3.8.3.8 annotate()

annotate() provides a final way to label your graph: it adds a single geom to your plot. When you use annotate(), you must first choose which type of geom to add. Next, you must manually supply a value for each aesthetic required by the geom.

So for example, we could use annotate() to add text to our plot.

p1 + annotate("text", x = 4, y = 7500, label = "There are no cheap,\nlarge diamonds")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Notice that I select geom_text() with “text”, the suffix of the function name in quotation marks.

In practice, I find annotate() time consuming to work with, but you can accomplish quite a lot with annotate() if you take the time.

3.8.4 Themes

One of the most effective ways to control the look of your plot is with a theme.

3.8.4.1 What is a theme?

A theme describes how the non-data elements of your plot should look. For example, these two plots show the same data, but they use two very different themes.

image image

3.8.4.2 Theme functions

To change the theme of your plot, add a theme_ function to your plot call. The ggplot2 package provides eight theme functions to choose from.

  • theme_bw()
  • theme_classic()
  • theme_dark()
  • theme_gray()
  • theme_light()
  • theme_linedraw()
  • theme_minimal()
  • theme_void()

Use the box below to plot p1 with each of the themes. Which theme do you prefer? Which theme does ggplot2 apply by default?

p1 + theme_bw()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_classic()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_dark()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_gray()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_light()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_linedraw()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_minimal()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_void()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

"Good Job! ggplot2 uses theme_gray()` by default."

3.8.4.3 ggthemes

If you would like to give your graph a more complete makeover, the ggthemes package provides extra themes that imitate the graph styles of popular software packages and publications. These include:

  • theme_base()
  • theme_calc()
  • theme_economist()
  • theme_economist_white()
  • theme_excel()
  • theme_few()
  • theme_fivethirtyeight()
  • theme_foundation()
  • theme_gdocs()
  • theme_hc()
  • theme_igray()
  • theme_map()
  • theme_pander()
  • theme_par()
  • theme_solarized()
  • theme_solarized_2()
  • theme_solid()
  • theme_stata()
  • theme_tufte()
  • theme_wsj()

Try plotting p1 with at least two or three of the themes mentioned above.

library(ggthemes)
p1 + theme_base()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_calc()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_economist()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_economist_white()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_excel()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_few()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_fivethirtyeight()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_foundation()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_gdocs()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_hc()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_igray()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_map()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_pander()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_par()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_solarized()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_solarized_2()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_solid()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_stata()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_tufte()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

p1 + theme_wsj()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

"Good Job! Notice that each theme supplies its own font sizes, which means that your captions might run off the page for some themes. In practice, you can fix this by resizing your graph window."

3.8.4.4 Update p1

If you compare the ggtheme themes to the styles they imitate, you might notice something: the colors used to plot your data haven’t changed. The colors are noticeably ggplot2 colors. In the next section, we’ll look at how to customize this remaining part of your graph: the data elements.

Before we go on, I suggest that we update p1 to use theme_bw(). It will make our next set of modifications easier to see.

p1 <- p1 + theme_bw()
p1
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

3.8.5 Scales

3.8.5.1 What is a scale?

Every time you map an aesthetic to a variable, ggplot2 relies on a scale to select the specific colors, sizes, or shapes to use for the values of your variable.

A scale is an R function that works like a mathematical function; it maps each value in a data space to a level in an aesthetic space. But it may be easier to think of a scale as a “palette.” When you give your graph a color scale, you give it a palette of colors to use.

3.8.5.2 Using scales

ggplot2 chooses a pleasing set of scales to use whenever you make a graph. You can change or customize these scales by adding a scale function to your plot call.

For example, the code below plots p1 in greyscale instead of the default colors.

p1 + scale_color_grey()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

#### A second example

You can add scales for every aesthetic mapping, including the x and y mappings (the code below log transforms the x and y axes).

p1 +
  scale_x_log10() + 
  scale_y_log10()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot2 supplies over 50 scales to use. This may seem overwhelming, but the scales are organized according to an intuitive naming convention.

3.8.5.3 Naming convention

ggplot2 scale functions follow a naming convention. Each function name contains the same three elements in order, separated by underscores:

  • The prefix scale
  • the name of an aesthetic, which the scale adjusts (e.g. color, fill, size)
  • a unique label for the scale (e.g. grey, brewer, manual)

scale_shape_manual() and scale_x_continuous() are examples of the naming scheme.

You can see the complete list of scale names at http://ggplot2.tidyverse.org/reference/. In this tutorial, we will focus on scales that work with the color aesthetic.

3.8.5.4 Discrete vs. continuous

Scales specialize in either discrete variables or continuous variables. In other words, you would use a different set of scales to map a discrete variable, like diamond clarity, than you would use to map a continuous variable, like diamond price.

Which type of variable does p1 map to the color aesthetic?

  • Discrete ✓
  • Continuous ✗
Correct!

p1 maps color to cut, a discrete variable with five distinct levels.

3.8.5.5 scale_color_brewer

One of the most useful color palettes for discrete variables is scale_color_brewer() (scale_fill_brewer() if you are working with fill. Run the code below to see the effect of the scale.

p1 + scale_color_brewer()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

"Good job! scale_color_brewer() applies a color palette from the RColorBrewer package, a package that specializes in attractive color palettes."

3.8.5.6 RColorBrewer

The RColorBrewer package contains a variety of palettes developed by Cynthis Brewer. Each palette is designed to look pleasing as well as to differentiate between the values represented by the palette. You can learn more about the color brewer project at colorbrewer2.org.

Altogether, the RColorBrewer package contains 35 palettes. You can see each palette and its name by running RColorBrewer::display.brewer.all(). Try it below.

library(RColorBrewer)
RColorBrewer::display.brewer.all()

"Good job! Our graph above used the Blues palette (the default)."
Last value being used to check answer is invisible. See `?invisible` for more information

3.8.5.7 Brewer palettes

By default, scale_color_brewer() will use the “Blues” palette from the RColorBrewer package. To use a different RColorBrewer palette, set the palette argument of scale_color_brewer() to one of the RColorBrewer palette names, surrounded by quotation marks, e.g.

p1 + scale_color_brewer(palette = "Purples")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

image #### Exercise - scale_color_brewer()

Recreate the graph below, which uses a different palette from the RColorBrewer package.

image

p1 + scale_color_brewer(palette = "Spectral")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

image

"Good job! scale_color_brewer() is one of the most useful functions for customizing colors in ggplot2 because it does for you the hard work of selecting a pleasing combination of colors. If you'd like to select individual colors yourself, try the scale_color_manual() function."

3.8.5.8 Continuous colors

scale_color_brewer() works with discrete variables, but what if your plot maps color to a continuous variable?

Since we do not have a plot that applies color to a continuous variable, let’s make one.

p_cont <- ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = hwy)) +
  theme_bw()

p_cont

3.8.5.9 Discrete vs. continuous in action

If we apply scale_color_brewer() to our new plot, we get an error message that confirms what you know: you cannot use a scale that is built for discrete variables to customize the mapping to a continuous variable.

p_cont + scale_color_brewer()
## Error: Continuous value supplied to discrete scale

3.8.5.10 distiller

Luckily, scale_color_brewer() has a comes with a continuous analogue named scale_color_distiller() (also scale_fill_distiller()).

Use scale_color_distiller() just as you would scale_color_brewer(). scale_color_distiller() will take any RColorBrewer palette, and interpolate between colors as necessary to provide an entire continuous range of colors.

So for example, we could reuse the Spectral palette in our continuous plot

p_cont + scale_color_distiller(palette = "Spectral")

#### Exercise - scale_color_distiller()

Recreate the graph below, which uses a different palette from the RColorBrewer package.

image

p_cont + scale_color_distiller(palette = "BrBG")

"Good job! ggplot2 also supplies scale_color_gradient(), scale_color_gradient2(), and scale_color_gradientn(), which you can use to construct gradients manually between 2, 3, and n colors."

3.8.5.11 viridis

The viridis package contains a collection of very good looking color palettes for continuous variables. Each palette is designed to show the gradation of continuous values in an attractive, and perceptionally uniform way (no range of values appears more important than another). As a bonus, the palettes are both color blind and black and white printer friendly!

To add a viridis palette, use scale_color_viridis() or scale_fill_viridis(), both of which come in the viridis package.

library(viridis)
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## The following object is masked from 'package:maps':
## 
##     unemp
p_cont + scale_color_viridis()

3.8.5.12 viridis options

Altogether, the viridis package comes with four color palettes, named magma, plasma, inferno, and viridis.

However, you do not select the palettes by name. To select a viridis color palette, set the option argument of scale_color_viridis() to one of “A” (magma), “B” (plasma), “C” (inferno), or “D” (viridis).

Try each option with p_cont below. Determine which is the default.

p_cont + scale_color_viridis("A")

p_cont + scale_color_viridis("B")

p_cont + scale_color_viridis("C")

p_cont + scale_color_viridis("D")  # D is default. See ? scale_color_viridis

"Good job! Option D is the default if you do not select an option."

3.8.6 Legends

3.8.6.1 Customizing a legend

The last piece of a ggplot2 graph to customize is the legend. When it comes to legends, you can customize the:

  • position of the legend within the graph
  • the “type” of the legend, or whether a legend appears at all
  • the title and labels in the legend

Customizing legends is a little more chaotic than customizing other parts of the graph, because the information that appears in a legend comes from several different places.

3.8.6.2 Positions

To change the position of a legend in a ggplot2 graph add one of the below to your plot call:

  • + theme(legend.position = “bottom”)
  • + theme(legend.position = “top”)
  • + theme(legend.position = “left”)
  • + theme(legend.position = “right”) (the default)

Try this now. Move the legend in p_cont to the bottom of the graph.

p_cont + theme(legend.position = "bottom")

"Good job! If you move the legend to the top or bottom of the plot, ggplot2 will reogranize the orientation of the legend from vertical to horizontal."

3.8.6.3 theme() vs. themes

Theme functions like theme_grey() and theme_bw() also adjust the legend position (among all of the other details they orchestrate). So if you use theme(legend.position = “bottom”) in your plots, be sure to add it after any theme_ functions you call, like this

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = hwy)) +
  theme_bw() +
  theme(legend.position = "bottom")

If you do this, ggplot2 will apply all of the settings of theme_bw(), and then overwrite the legend position setting to “bottom” (instead of vice versa).

3.8.6.4 Types

You may have noticed that color and fill legends take two forms. If you map color (or fill) to a discrete variable, the legend will look like a standard legend. This is the case for the bottom legend below.

If you map color (or fill) to a continuous legend, your legend will look like a colorbar. This is the case in the top legend below. The color bar helps convey the continuous nature of the variable.

3.8.6.5 Changing type

You can use the guides() function to change the type or presence of each legend in the plot. To use guides(), type the name of the aesthetic whose legend you want to alter as and argument name. Then set it to one of

  • “legend” - to force a legend to appear as a standard legend instead of a colorbar
  • “colorbar” - to force a legend to appear as a colorbar instead of a standard legend. Note: this can only be used when the legend can be printed as a colorbar (in which case the default will be colorbar).
  • “none” - to remove the legend entirely. This is useful when you have redundant aesthetic mappings, but it may make your plot indecipherable otherwise.
p_legend + guides(fill = "legend", color = "none")

image

p_cont

p_cont + guides(fill = "legend", color = "none")

p_legend: https://github.com/rstudio-education/primers/blob/master/visualize-data/08-Customize/08-Customize.Rmd

p_legend <- ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = class, fill = hwy), 
              shape = 21, size = 3, stroke = 1) +
  theme_bw()
p_legend + guides(fill = "none", color = "none")

image

p_cont + guides(fill = "none", color = "none")

"Good job! If you move the legend to the top or bottom of the plot, ggplot2 will reogranize the orientation of the legend from vertical to horizontal."

3.8.6.6 Labels

To control the title and labels of a legend, you must turn to the scale_ functions. Each scale_ function takes a name and a labels argument, which it will use to build the legend associated with the scale. The labels argument should be a vector of strings that has one string for each label in the default legend.

So for example, you can adjust the legend of p1 with

p1 + scale_color_brewer(name = "Cut Grade", labels = c("Very Bad", "Bad", "Mediocre", "Nice", "Very Nice"))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

3.8.6.7 What if?

This is handy, but it raises a question: what if you haven’t invoked a scale_ function to pass labels to? For example, the graph below relies on the default scales.

p1
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

3.8.6.8 Default scales

In this case, you need to identify the default scale used by the plot and then manually add that scale to the plot, setting the labels as you do.

For example, our plot above relies on the default color scale for a discrete variable, which happens to be scale_color_discrete(). If you know this, you can relabel the legend like so:

p1 + scale_color_discrete(name = "Cut Grade", labels = c("Very Bad", "Bad", "Mediocre", "Nice", "Very Nice"))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

3.8.6.9 Scale defaults

As you can see, it is handy to know which scales a ggplot2 graph will use by default. Here’s a short list.

aesthetic variable default
x continuous scale_x_continuous()
discrete scale_x_discrete()
y continuous scale_y_continuous()
discrete scale_y_discrete()
color continuous scale_color_continuous()
discrete scale_color_discrete()
fill continuous scale_fill_continuous()
discrete scale_fill_discrete()
size continuous scale_size()
shape discrete scale_shape()

3.8.7 Quiz

In this tutorial, you learned how to customize the graphs that you make with ggplot2 in several ways. You learned how to:

  • Zoom in on regions of the graph
  • Add titles, subtitles, and annotations
  • Add themes
  • Add color scales
  • Adjust legends

To cement your skills, combine what you’ve learned to recreate the plot below.

image

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_point() + 
  geom_smooth(mapping = aes(color = cut), se = FALSE) +  
  labs(title = "Ideal cut diamonds command the best price for every carat size",
       subtitle = "Lines show GAM estimate of mean values for each level of cut",
       caption = "Data provided by Hadley Wickham",
       x = "Log Carat Size",
       y = "Log Price Size",
       color = "Cut Rating") +
  scale_x_log10() +
  scale_y_log10() +
  scale_color_brewer(palette = "Greens") +
  theme_light()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

4 Tidy Your Data

Unlock the tidyverse by learning how to make and use tidy data, the data format designed for R.

4.1 Reshape Data

Data comes in many formats, but R prefers just one: Tidy Data. Learn to recognise and make tidy data in this tutorial, as well as how to reshape the layout of any data set.

4.1.1 Welcome

The tools that you learned in the previous Primers work best when your data is organized in a specific way. This format is known as tidy data and it appears throughout the tidyverse. You will spend a lot of time as a data scientist wrangling your data into a useable format, so it is important to learn how to do this fast.

This tutorial will teach you how to recognize tidy data, as well as how to reshape untidy data into a tidy format. In it, you will learn the core data wrangling functions for the tidyverse:

  • gather() - which reshapes wide data into long data, and [pivot_longer()]
  • spread() - which reshapes long data into wide data [pivot_wider()]

This tutorial uses the core tidyverse packages, including ggplot2, dplyr, and tidyr, as well as the babynames package. All of these packages have been pre-installed and pre-loaded for your convenience.

Click the Next Topic button to begin.

4.1.2 Tidy Data

4.1.2.1 Variables, values, and observations

In Exploratory Data Analysis, we proposed three definitions that are useful for data science:

  • A variable is a quantity, quality, or property that you can measure.

  • A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

  • An observation is a set of measurements that are made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a case or data point.

These definitions are tied to the concept of tidy data. To see how, let’s apply the definitions to some real data.

4.1.2.2 Quiz 1 - What are the variables?

table1
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

What are the variables in the data set above. Check all that apply.

  • country ✓
  • year ✓
  • cases ✓
  • population ✓
  • count ✗
  • type ✗
Good Job! The data set contains four variables measured on six observations: country, year, cases, and population.

4.1.2.3 Quiz 2 - What are the variables?

Now consider this data set. Does it contain the same variables?

table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

Does the data above contain the variables country, year, cases, and population?

  • Yes ✓
  • No ✗
Correct!

If you look closely, you will see that this is the same data set as before, but organized in a new way.

4.1.2.4 The shapes of data

These data sets reveal something important: you can reorganize the same set of variables, values, and observations in many different ways.

It’s not hard to do. If you run the code chunks below, you can see the same data displayed in three more ways.

table3
## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
table4a; table4b
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
## # A tibble: 3 × 3
##   country         `1999`     `2000`
##   <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
table5
## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

4.1.3 Tidy data

Data can come in a variety of formats, but one format is easier to use in R than the others. This format is known as tidy data. A data set is tidy if:

  1. Each variable is in its own column
  2. Each observation is in its own row
  3. Each value is in its own cell (this follows from #1 and #2)

Among our tables above, only table1 is tidy.

table1
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

4.1.3.1 Extracting variables

To see why tidy data is easier to use, consider a basic task. Each code chunk below extracts the values of the cases variable as a vector and computes the mean of the variable. One uses a tidy table, table1:

mean(table1$cases)
## [1] 91276.67

The other uses an untidy table, table2:

mean(table2$count[c(1, 3, 5, 7, 9, 11)])
## [1] 91276.67

Which line of code is easier to write? Which line could you write if you’ve only looked at the first row of the data?

4.1.3.2 Reusing code

Not only is the code for table1 easier to write, it is easier to reuse. To see what I mean, modify the code chunks below to compute the mean of the population variable for each table.

First with table1:

mean(table1$cases)
## [1] 91276.67

Then with table2:

mean(table2$count[c(1, 3, 5, 7, 9, 11)])
## [1] 91276.67

Again table1 is easier to work with; you only need to change the name of the variable that you wish to extract. Code like this is easier to generalize to new data sets (if they are tidy) and easier to automate with a function.

Let’s look at one more advantage.

4.1.3.3 Calculations

Suppose you would like to compute the ratios of cases to population for each country and each year. To do this, you need to ensure that the correct value of cases is paired with the correct value of population when you do the calculation.

Again, this is hard to do with untidy table2:

table2$count[c(1, 3, 5, 7, 9, 11)] / table2$count[c(2, 4, 6, 8, 10, 12)]
## [1] 0.0000372741 0.0001294466 0.0002193930 0.0004612363 0.0001667495
## [6] 0.0001669488

But it is easy to do with tidy table1. Give it a try below:

table1$cases / table1$population
## [1] 0.0000372741 0.0001294466 0.0002193930 0.0004612363 0.0001667495
## [6] 0.0001669488

These small differences may seem petty, but they add up over the course of a data analysis, stealing time and inviting mistakes.

4.1.3.4 Tidy data and R

The tidy data format works so well for R because it aligns the structure of your data with the mechanics of R:

  • R stores each data frame as a list of column vectors, which makes it easy to extract a column from a data frame as a vector. Tidy data places each variable in its own column vector, which makes it easy to extract all of the values of a variable to compute a summary statistic, or to use the variable in a computation.

  • R computes many functions and operations in a vectorized fashion, matching the first values of each vector of input to compute the first result, matching the second values of each input to compute the second result, and so on. Tidy data ensures that R will always match values with other values from the same operation whenever vector inputs are drawn from the same table.

image As a result, most functions in R—and every function in the tidyverse—will expect your data to be organized into a tidy format. (You may have noticed above that we could use dplyr functions to work on table1, but not on table2).

4.1.3.5 Recap

“Data comes in many formats, but R prefers just one: tidy data.” — Garrett Grolemund

A data set is tidy if:

  1. Each variable is in its own column
  2. Each observation is in its own row
  3. Each value is in its own cell (this follows from #1 and #2)

Now that you know what tidy data is, what can you do about untidy data?

4.1.4 Gathering columns

4.1.4.1 Untidy data

“Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham

How you tidy an untidy data set will depend on the initial configuration of the data. For example, consider the cases data set below.

cases
## # A tibble: 3 × 4
##   Country `2011` `2012` `2013`
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 FR        7000   6900   7000
## 2 DE        5800   6000   6200
## 3 US       15000  14000  13000

4.1.4.2 Quiz 3 - What are the variables?

What are the variables in cases?

  • Country, 2011, 2012, and 2013 ✗
  • Country, year, and some unknown quantity (n, count, number of cases, etc.) ✓
  • FR, DE, and US ✗
Correct!

4.1.4.3 A tidy version of

Video: https://vimeo.com/229581273

4.1.4.4 gather()

You can use the gather() function in the tidyr package to convert wide data to long data. Notice that gather() returns a tidy copy of the dataset, but does not alter the original dataset. If you wish to use this copy later, you’ll need to save it somewhere.

cases %>% gather(key = "year", value = "n", 2, 3, 4)
## # A tibble: 9 × 3
##   Country year      n
##   <chr>   <chr> <dbl>
## 1 FR      2011   7000
## 2 DE      2011   5800
## 3 US      2011  15000
## 4 FR      2012   6900
## 5 DE      2012   6000
## 6 US      2012  14000
## 7 FR      2013   7000
## 8 DE      2013   6200
## 9 US      2013  13000
# pivot_longer
cases %>% pivot_longer(cols = 2:4, names_to = "year", values_to = "n")
## # A tibble: 9 × 3
##   Country year      n
##   <chr>   <chr> <dbl>
## 1 FR      2011   7000
## 2 FR      2012   6900
## 3 FR      2013   7000
## 4 DE      2011   5800
## 5 DE      2012   6000
## 6 DE      2013   6200
## 7 US      2011  15000
## 8 US      2012  14000
## 9 US      2013  13000

Let’s take a closer look at the gather() syntax.

4.1.4.5 gather() syntax

Here’s the same call written without the pipe operator, which makes the syntax easier to see.

gather(cases, key = "year", value = "n", 2, 3, 4)

To use gather(), pass it the name of a data set to reshape followed by two new column names to use. Each name should be a character string surrounded by quotes:

  • the key string will become the name of a new column that contains former column names.
  • the value string will become the name of a new column that contains former cell values.

Finally, use numbers to tell gather() which columns to use to build the new columns. Here gather will use the second, third, and fourth columns. gather() will remove these columns from the results, but their contents will appear in the new columns. Any unspecified columns will remain in the dataset, their contents repeated as often as necessary to duplicate each relationship in the original untidy data set.

4.1.4.6 Key and Value columns

[To be replaced with a video]

gather() relies on the idea of key:value pairs. A key value pair is a pair that lists a value alongside the name of the variable that the value describes. (We could store every value in a dataset as a key value pair, but this is not how R works.)

In a tidy data set, you will find “keys”—that is variable names—in the column names of the data set. The values will appear in the cells of the columns. Here we know that the key for each value in the year column is year. This arrangement reduces duplication.

Sometimes you will also find key value pairs listed beside each other in two separate columns, as in table2. Here the type column lists the keys that are associated with the count column. This layout is sometimes called “narrow” data.

Tidyr functions rely on the key value vocabulary to describe what should go where. In gather() the key argument describes the new column that contains the values that previously appeared in the tidy key position, i.e. in the column names. The value argument describes the new column that contains the values that previously appeared in the value positions, e.g. in the column cells.

4.1.4.7 Exercise 1 - Tidy table4a

Now that you’ve seen gather() in action, try using it to tidy table4a:

table4a
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
cases %>% gather(key = "year", value = "n", 2, 3, 4)
table4a %>% gather(key = "year", value = "n", 2, 3)
## # A tibble: 6 × 3
##   country     year       n
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
# pivot_longer
table4a %>% pivot_longer(cols = 2:3, names_to = "year", values_to = "n")
## # A tibble: 6 × 3
##   country     year       n
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766
"Good job!"

4.1.4.8 Specifying columns

So far we’ve used numbers to describe which columns to reshape with gather(), but this isn’t necessary. gather() also recognizes column names as well as all of the select() helpers that you learned about in Isolating Data with dplyr. So for example, these expressions would all do the same thing:

table4a %>% gather(key = "year", value = "cases", 2, 3)
table4a %>% gather(key = "year", value = "cases", `1999`, `2000`)
table4a %>% gather(key = "year", value = "cases", -country)
table4a %>% gather(key = "year", value = "cases", one_of(c("1999", "2000")))

Notice that 1999 and 2000 are numbers. When you directly call column names that are numbers, you need to surround the names with backticks (otherwise gather() would think you mean the 1999th and 2000th columns). Use ?select_helpers to open a help page that lists the select helpers.

4.1.4.9 Exercise 2 - Tidy table4b

Use gather() and the - helper to tidy table4b into a dataset with three columns: country, year, and population.

table4b
## # A tibble: 3 × 3
##   country         `1999`     `2000`
##   <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
table4b %>% gather(key = "year", value = "population", -country)
## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Brazil      1999   172006362
## 3 China       1999  1272915272
## 4 Afghanistan 2000    20595360
## 5 Brazil      2000   174504898
## 6 China       2000  1280428583
# pivot_longer
table4b %>% pivot_longer(cols = -country, names_to = "year", values_to = "population")
## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Afghanistan 2000    20595360
## 3 Brazil      1999   172006362
## 4 Brazil      2000   174504898
## 5 China       1999  1272915272
## 6 China       2000  1280428583

4.1.4.10 Converting output

If you looked closely at your results in the previous exercises, you may have noticed something odd: the new year column contains character vectors. You can tell because R displays beneath the column name.

table4b %>% gather(key = "year", value = "population", -country, convert = TRUE)
## # A tibble: 6 × 3
##   country      year population
##   <chr>       <int>      <int>
## 1 Afghanistan  1999   19987071
## 2 Brazil       1999  172006362
## 3 China        1999 1272915272
## 4 Afghanistan  2000   20595360
## 5 Brazil       2000  174504898
## 6 China        2000 1280428583
# pivot_longer
table4b %>% pivot_longer(cols = -country, names_to = "year", values_to = "population", names_transform = list(year = as.integer))
## # A tibble: 6 × 3
##   country      year population
##   <chr>       <int>      <int>
## 1 Afghanistan  1999   19987071
## 2 Afghanistan  2000   20595360
## 3 Brazil       1999  172006362
## 4 Brazil       2000  174504898
## 5 China        1999 1272915272
## 6 China        2000 1280428583

You can ask R to convert each new column to an appropriate data type by adding convert = TRUE to the gather() call. R will inspect the contents of the columns to choose the most likely data type. Give it a try in the code above!

4.1.4.11 The flexibility of gather()

cases, table4a, and table4b are all rectangular tables:

  • each row corresponds to the value of a variable, and
  • each column corresponds to the value of a variable

Rectangular tables are a simple form of wide data. But you will also encounter more complicated examples of wide data. For example, it is common for researchers to place one subject per row. In this case, you might see several columns of identifying information followed by a set of columns that list repeated measurements of the same variable. cases2 emulates such a data set.

cases2
## # A tibble: 3 × 6
##   city    country continent     `2011` `2012` `2013`
##   <chr>   <chr>   <chr>          <dbl>  <dbl>  <dbl>
## 1 Paris   FR      Europe          7000   6900   7000
## 2 Berlin  DE      Europe          5800   6000   6200
## 3 Chicago US      North America  15000  14000  13000
cases2 %>% gather(key = "year", value = "cases", 4:6)
## # A tibble: 9 × 5
##   city    country continent     year  cases
##   <chr>   <chr>   <chr>         <chr> <dbl>
## 1 Paris   FR      Europe        2011   7000
## 2 Berlin  DE      Europe        2011   5800
## 3 Chicago US      North America 2011  15000
## 4 Paris   FR      Europe        2012   6900
## 5 Berlin  DE      Europe        2012   6000
## 6 Chicago US      North America 2012  14000
## 7 Paris   FR      Europe        2013   7000
## 8 Berlin  DE      Europe        2013   6200
## 9 Chicago US      North America 2013  13000
# pivot_longer
cases2 %>% pivot_longer(cols = 4:6, names_to = "year", values_to = "cases")
## # A tibble: 9 × 5
##   city    country continent     year  cases
##   <chr>   <chr>   <chr>         <chr> <dbl>
## 1 Paris   FR      Europe        2011   7000
## 2 Paris   FR      Europe        2012   6900
## 3 Paris   FR      Europe        2013   7000
## 4 Berlin  DE      Europe        2011   5800
## 5 Berlin  DE      Europe        2012   6000
## 6 Berlin  DE      Europe        2013   6200
## 7 Chicago US      North America 2011  15000
## 8 Chicago US      North America 2012  14000
## 9 Chicago US      North America 2013  13000

4.1.5 Spreading columns

4.1.5.1 Narrow data

The pollution dataset below displays the amount of small and large particulate in the air of three cities. It illustrates another common type of untidy data. Narrow data uses a literal key column and a literal value column to store multiple variables. Can you tell here which is which?

pollution
## # A tibble: 6 × 3
##   city     size  amount
##   <chr>    <chr>  <dbl>
## 1 New York large     23
## 2 New York small     14
## 3 London   large     22
## 4 London   small     16
## 5 Beijing  large    121
## 6 Beijing  small    121

4.1.5.2 Quiz 4 - Which is the key column?

pollution
## # A tibble: 6 × 3
##   city     size  amount
##   <chr>    <chr>  <dbl>
## 1 New York large     23
## 2 New York small     14
## 3 London   large     22
## 4 London   small     16
## 5 Beijing  large    121
## 6 Beijing  small    121

Which column in pollution contains key names (i.e. variable names)?

  • city ✗
  • size ✓
  • amount ✗
Correct!

Two properties are being measured in this data: 1) the amount of small particulate in the air, and 2) the amount of large particulate

4.1.5.3 Quiz 5 - Which is the value column?

pollution
## # A tibble: 6 × 3
##   city     size  amount
##   <chr>    <chr>  <dbl>
## 1 New York large     23
## 2 New York small     14
## 3 London   large     22
## 4 London   small     16
## 5 Beijing  large    121
## 6 Beijing  small    121

Which column in pollution contains the values associated with the key names?

  • city ✗
  • size ✗
  • amount ✓
Correct!

What do these numbers represent? You can only tell when you match them with the variable names large (for large particulate) and small (for small particulate).

4.1.5.4 A tidy version of pollution

Video: https://vimeo.com/229581273

4.1.5.5 spread()

You can “spread” the keys in a key column across their own set of columns with the spread() function in the tidyr package. To use spread() pass it the name of a data set to spread (provided here by the pipe %>%). Then tell spread which column to use as a key column and which column to use as a value column.

pollution %>% spread(key = size, value = amount)
## # A tibble: 3 × 3
##   city     large small
##   <chr>    <dbl> <dbl>
## 1 Beijing    121   121
## 2 London      22    16
## 3 New York    23    14
# pivot_wider
pollution %>% pivot_wider(names_from = size, values_from = amount)
## # A tibble: 3 × 3
##   city     large small
##   <chr>    <dbl> <dbl>
## 1 New York    23    14
## 2 London      22    16
## 3 Beijing    121   121

spread() will give each unique value in the key column its own column. The name of the value will become the column name. spread() will then redistribute the values in the value column across the new columns in a way that preserves every relationship in the original dataset.

4.1.5.6 Exercise 3 - Tidy table2

Use spread() to tidy table2 into a dataset with four columns: country, year, cases, and population. In short, convert table2 to look like table1.

table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table2 %>% spread(key = type, value = count)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
# pivot_wider
table2 %>% pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

4.1.5.7 To quote or not to quote

You may notice that both gather() and spread() take key and value arguments. And, in each case the arguments are set to column names. But in the gather() you must surround the names with quotes and in the spread() case you do not. Why is this?

table4b %>% gather(key = "year", value = "population", -country)
pollution %>% spread(key = size, value = amount)

Don’t let the difference trip you up. Instead think about what the quotes mean.

  • In R, any sequence of characters surrounded by quotes is a character string, which is a piece of data in and of itself.
  • Likewise, any sequence of characters not surrounded by quotes is the name of an object, which is a symbol that contains or points to a piece of data. Whenever R evaluates an object name, it searches for the object to find the data that it contains. If the object does not exist somewhere, R will return an error.

In our gather() code above, “year” and “population” refer to two columns that do not yet exist. If R tried to look for objects named year and population it wouldn’t find them (at least not in the table4b dataset). When we use gather() we are passing R two values (character strings) to use as the name of future columns that will appear in the result.

In our spread() code, key and value point to two columns that do exist in the pollution dataset: size and amount. When we use spread(), we are telling R to find these objects (columns) in the dataset and to use their contents to create the result. Since they exist, we do not need to surround them in quotation marks.

In practice, whether or not you need to use quotation marks will depend on how the author of your function wrote the function (For example, spread() will still work if you do include quotation marks). However, you can use the intuition above as a guide for how to use functions in the tidyverse.

4.1.5.8 Boys and girls in babynames

Let’s apply spread() to a real world inquiry. The plot below visualizes an aspect of the babynames data set from the babynames package. (See Work with Data for an introduction to the babynames data set.)

image The ratio of girls to boys in babynames is not constant across time. We can explore this phenomenon further by recreating the data in the plot.

4.1.5.9 Review - Make the data

image

To make the data displayed in the plot above, I first grouped babynames by year and sex. Then I computed a summary for each group: total, which is equal to the sum of n for each group.

Use dplyr functions to recreate this process in the chunk below.

babynames %>% 
  group_by(year, sex) %>% 
  summarize(total = sum(n))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 276 × 3
##     year sex    total
##    <dbl> <chr>  <int>
##  1  1880 F      90993
##  2  1880 M     110491
##  3  1881 F      91953
##  4  1881 M     100743
##  5  1882 F     107847
##  6  1882 M     113686
##  7  1883 F     112319
##  8  1883 M     104627
##  9  1884 F     129020
## 10  1884 M     114442
## # … with 266 more rows

4.1.5.10 Review - Make the plot

image

babynames %>%
  group_by(year, sex) %>% 
  summarise(total = sum(n))  %>%
  ggplot() +
    geom_line(mapping = aes(x = year, y = total, color = sex))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

4.1.5.11 A better way to look at the data

A better way to explore this phenomena would be to directly plot a ratio of boys to girls over time. To make such a plot, you would need to compute the ratio of boys to girls for each year from 1880 to 2015:

\[\mbox{ratio male} = \frac{\mbox{total male}}{\mbox{total female}}\]

But how can we plot this data? Our current iteration of babynames places the total number of boys and girls for each year in the same column, which makes it hard to use both totals in the same calculation.

babynames %>%
  group_by(year, sex) %>% 
  summarise(total = sum(n))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 276 × 3
##     year sex    total
##    <dbl> <chr>  <int>
##  1  1880 F      90993
##  2  1880 M     110491
##  3  1881 F      91953
##  4  1881 M     100743
##  5  1882 F     107847
##  6  1882 M     113686
##  7  1883 F     112319
##  8  1883 M     104627
##  9  1884 F     129020
## 10  1884 M     114442
## # … with 266 more rows

4.1.5.12 A goal

It would be easier to calculate the ratio of boys to girls if we could reshape our data to place the total number of boys born per year in one column and the total number of girls born per year in another:

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 138 × 3
##     year      F      M
##    <dbl>  <int>  <int>
##  1  1880  90993 110491
##  2  1881  91953 100743
##  3  1882 107847 113686
##  4  1883 112319 104627
##  5  1884 129020 114442
##  6  1885 133055 107799
##  7  1886 144533 110784
##  8  1887 145981 101413
##  9  1888 178622 120851
## 10  1889 178366 110580
## # … with 128 more rows

Then we could compute the ratio by piping our data into a call like mutate(ratio = M / F).

4.1.5.13 Exercise 4 - Make the plot

Add to the code below to:

  1. Reshape the layout to place the total number of boys per year in one column and the total number of girls born per year in a second column.
  2. Compute the ratio of boys to girls.
  3. Plot the ratio of boys to girls over time.
babynames %>%
  group_by(year, sex) %>% 
  summarise(total = sum(n)) %>%
  spread(key = sex, value = total) %>%
  mutate(ratio = M / F) %>%
  ggplot() + 
    geom_line(mapping = aes(x = year, y = ratio))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

4.1.5.14 Interesting

Our results reveal a conspicuous oddity, that is easier to interpret if we turn the ratio into a percentage.

image The percent of recorded male births is unusually low between 1880 and 1936. What is happening? One insight is that the data comes from the United States Social Security office, which was only created in 1936. As a result, we can expect the data prior to 1936 to display a survivorship bias.

4.1.5.15 Recap

Your data will be easier to work with in R if you reshape it into a tidy layout at the start of your analysis. Data is tidy if:

  1. Each variable is in its own column
  2. Each observation is in its own row
  3. Each value is in its own cell

You can use gather() and spread(), or some iterative sequence of the two, to reshape your data into any possible configuration that:

  1. Retains all of the values in your original data set, and
  2. Retains all of the relationships between values in your original data set.

In particular, you can use these functions to recast your data into a tidy layout.

4.1.5.16 Food for thought

It is not always clear whether or not a data set is tidy. For example, the version of babynames that was tidy when we wanted to plot total children by year, was no longer tidy when we wanted to compute the ratio of male to female children.

The ambiguity comes from the definition of tidy data. Tidiness depends on the variables in your data set. But what is a variable depends on what you are trying to do.

To identify the variables that you need to work with, describe what you want to do with an equation. Each variable in the equation should correspond to a variable in your data.

So in our first case, we wanted to make a plot with the following mappings (e.g. equations)

\[x = \mbox{year}\] \[y=\mbox{total}\] \[\mbox{color}=\mbox{sex}\]

To do this, we needed a data set that placed year, total, and sex each in their own columns.

In our second case we wanted to compute ratio, where

\[\mbox{ratio}=\frac{\mbox{male}}{\mbox{female}}\]

This formula has three variables: ratio male, total male, and total female. To create the first variable, we required a data set that isolated the second and third variables (total male and total female) in their own columns.

4.2 Separate and Unite Columns

Here you will learn to separate a column into multiple columns and to reverse the process by uniting multiple columns into a single column. Then you’ll practice your data wrangling skills on messy real world data.

4.2.1 Welcome

Data is easiest to analyze in R when it is stored in a tidy format. In the last tutorial, you learned how to tidy data that has an untidy layout, but there is another way that data sets can be untidy: a data set can combine multiple values in a single cell or spread a single value across multiple cells. This makes it difficult to extract and use values in your analysis.

This tutorial will teach you two tools that you can use to tidy this type of data:

  • separate() - which separates a column of cells into multiple columns
  • unite() - which combines multiple columns of cells into a single column

It ends with a case study that requires you to use all of the tidy tools to wrangle a messy real world data set.

This tutorial uses the core tidyverse packages, including tidyr. All of these packages have been pre-installed and pre-loaded for your convenience.

Click the Next Topic button to begin.

4.2.2 separate()

4.2.2.1 hurricanes

The hurricanes data set contains historical information about five hurricanes. At first glance it appears to contain four variables: name, wind_speed, pressure, and date. However, there are three more variables hidden in plain sight. Can you spot them?

## # A tibble: 6 × 4
##   name    wind_speed pressure date      
##   <chr>        <dbl>    <dbl> <chr>     
## 1 Alberto        110     1007 2000-08-03
## 2 Alex            45     1009 1998-07-27
## 3 Allison         65     1005 1995-06-03
## 4 Ana             40     1013 1997-06-30
## 5 Arlene          50     1010 1999-06-11
## 6 Arthur          45     1010 1996-06-17

Which variables are “hidden” in hurricanes? Check three.

  • location ✗
  • year ✓
  • month ✓
  • day ✓
Good job! The date variable also displays the year, month, and day associated with each measurement.

4.2.2.2 Dates

Did you realize that dates are a combination of multiple variables? They are.

You’ll almost always display these variables together to make a date, because a date is itself a variable—one that conveys more than the sum of its parts.

However, there are times where it is convenient to treat each element of a date separately. For example, what if you wanted to filter hurricanes to just the storms that occurred in June (i.e. month == 6)? Then it would be convenient to reorganize the data to look like this.

## # A tibble: 6 × 4
##   name    wind_speed pressure date      
##   <chr>        <dbl>    <dbl> <chr>     
## 1 Alberto        110     1007 2000-08-03
## 2 Alex            45     1009 1998-07-27
## 3 Allison         65     1005 1995-06-03
## 4 Ana             40     1013 1997-06-30
## 5 Arlene          50     1010 1999-06-11
## 6 Arthur          45     1010 1996-06-17

But how could you do it?

4.2.2.3 separate()

You can separate the elements of date with the separate() function. separate() divides a column of values into multiple columns that each contain a portion of the original values.

Run the code below to see separate() in action. Then click continue to learn about the syntax.

hurricanes %>% 
  separate(col = date, into = c("year","month","day"), sep = "-")
## # A tibble: 6 × 6
##   name    wind_speed pressure year  month day  
##   <chr>        <dbl>    <dbl> <chr> <chr> <chr>
## 1 Alberto        110     1007 2000  08    03   
## 2 Alex            45     1009 1998  07    27   
## 3 Allison         65     1005 1995  06    03   
## 4 Ana             40     1013 1997  06    30   
## 5 Arlene          50     1010 1999  06    11   
## 6 Arthur          45     1010 1996  06    17
"Good job! As with other tidyverse functions, `separate()` returns a modified copy of the orginal data. You will need to save the copy if you wish to use it later."

4.2.2.4 Syntax

Let’s rewrite our above command without the pipe, to make the syntax of separate() easier to see.

separate(hurricanes, col = date, into = c("year","month","day"), sep = "-")

separate() takes a data frame and then the name of a column in the data frame to separate. Here our code will separate the date column of the hurricane data set.

The sep = “-” argument tells separate() to split each value in date wherever a - appears. You can choose to split on any character or character string.

Separating on - will split each date into three dates: a year, month, and day. As a result, separate() will need to add three new columns to the result. The into argument gives separate() a character vector of names to use for the new columns. Since the result will have three new columns, this vector will need to have three new names. separate() will provide an error message if it ends up creating fewer or more columns than column names.

4.2.2.5 Defaults

By default separate() will separate values at the location of any non-alphanumeric character, like -, ,, /, etc. So for example, we could run our code without the sep = “-” argument and—in this case—get the same result.

Or will we? Do a quick mental check and then run the code to see if you are right.

hurricanes %>% 
  separate(col = date, into = c("year","month","day"))
## # A tibble: 6 × 6
##   name    wind_speed pressure year  month day  
##   <chr>        <dbl>    <dbl> <chr> <chr> <chr>
## 1 Alberto        110     1007 2000  08    03   
## 2 Alex            45     1009 1998  07    27   
## 3 Allison         65     1005 1995  06    03   
## 4 Ana             40     1013 1997  06    30   
## 5 Arlene          50     1010 1999  06    11   
## 6 Arthur          45     1010 1996  06    17
'Good job! "-" is the only non-alphanumeric character used in our dates, which means that the defaults return the same output as setting sep = "-"'

4.2.2.6 Separating by position

If you set sep equal to an integer, separate() will split the values at the location indicated by the integers. For example,

  • sep = 1 will split the values after the first character
  • sep = -2 will split the values after the second to last character, no matter how many characters appear in the value. In other words, it will split off the last character of each value.
  • sep = c(2, 4, 6) will split the values after the second, fourth, and sixth characters, creating four sub-values

Think you have it? Create this version of hurricanes by adding a second call to separate() that uses an integer separator to the code below:

## # A tibble: 6 × 7
##   name    wind_speed pressure century year  month day  
##   <chr>        <dbl>    <dbl> <chr>   <chr> <chr> <chr>
## 1 Alberto        110     1007 20      00    08    03   
## 2 Alex            45     1009 19      98    07    27   
## 3 Allison         65     1005 19      95    06    03   
## 4 Ana             40     1013 19      97    06    30   
## 5 Arlene          50     1010 19      99    06    11   
## 6 Arthur          45     1010 19      96    06    17
hurricanes %>% 
  separate(col = date, into = c("year","month","day")) %>%
  separate(col = year, into = c("century", "year"), sep = 2)
## # A tibble: 6 × 7
##   name    wind_speed pressure century year  month day  
##   <chr>        <dbl>    <dbl> <chr>   <chr> <chr> <chr>
## 1 Alberto        110     1007 20      00    08    03   
## 2 Alex            45     1009 19      98    07    27   
## 3 Allison         65     1005 19      95    06    03   
## 4 Ana             40     1013 19      97    06    30   
## 5 Arlene          50     1010 19      99    06    11   
## 6 Arthur          45     1010 19      96    06    17

4.2.2.7 Quiz - What if

Would these two commands return the same result? Why or why not? Once you have an answer, run the code below to see if you were right.

hurricanes %>% 
  separate(col = pressure, into = c("first", "last"), sep = 1)
## # A tibble: 6 × 5
##   name    wind_speed first last  date      
##   <chr>        <dbl> <chr> <chr> <chr>     
## 1 Alberto        110 1     007   2000-08-03
## 2 Alex            45 1     009   1998-07-27
## 3 Allison         65 1     005   1995-06-03
## 4 Ana             40 1     013   1997-06-30
## 5 Arlene          50 1     010   1999-06-11
## 6 Arthur          45 1     010   1996-06-17
"When sep = 1, separate() splits after the first character"
hurricanes %>% 
  separate(col = pressure, into = c("first", "last"), sep = "1")
## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [4, 5, 6].
## # A tibble: 6 × 5
##   name    wind_speed first last  date      
##   <chr>        <dbl> <chr> <chr> <chr>     
## 1 Alberto        110 ""    007   2000-08-03
## 2 Alex            45 ""    009   1998-07-27
## 3 Allison         65 ""    005   1995-06-03
## 4 Ana             40 ""    0     1997-06-30
## 5 Arlene          50 ""    0     1999-06-11
## 6 Arthur          45 ""    0     1996-06-17
When sep = "1", separate() splits at every appearance of the character "1". This happens because R treats a 1 surrounded by quotation marks as a character string, not a number.'"When sep = 1, separate() splits after the first character"

4.2.2.8 Convert

You may have noticed that separate() returns its results as columns of character strings. However, in some cases, like ours, the columns will contain integers, doubles, or other types of non-character data.

You can ask separate() to convert the new columns to an appropriate data type by adding convert = TRUE to your separate() call. This is identical to the convert = TRUE argument of gather().

Identify the data types of year, month, and day (they appear under the column names) in the output below. Then add convert = TRUE and re-run the code. What changes?

hurricanes %>% 
  separate(col = date, into = c("year","month","day"), convert = TRUE)
## # A tibble: 6 × 6
##   name    wind_speed pressure  year month   day
##   <chr>        <dbl>    <dbl> <int> <int> <int>
## 1 Alberto        110     1007  2000     8     3
## 2 Alex            45     1009  1998     7    27
## 3 Allison         65     1005  1995     6     3
## 4 Ana             40     1013  1997     6    30
## 5 Arlene          50     1010  1999     6    11
## 6 Arthur          45     1010  1996     6    17

4.2.2.9 Remove

Let’s take a look at one last argument for separate(). If you add remove = FALSE to your separate() call, R will retain the original column in the results.

hurricanes %>% 
  separate(col = date, into = c("year","month","day"), convert = TRUE, remove = FALSE)
## # A tibble: 6 × 7
##   name    wind_speed pressure date        year month   day
##   <chr>        <dbl>    <dbl> <chr>      <int> <int> <int>
## 1 Alberto        110     1007 2000-08-03  2000     8     3
## 2 Alex            45     1009 1998-07-27  1998     7    27
## 3 Allison         65     1005 1995-06-03  1995     6     3
## 4 Ana             40     1013 1997-06-30  1997     6    30
## 5 Arlene          50     1010 1999-06-11  1999     6    11
## 6 Arthur          45     1010 1996-06-17  1996     6    17

4.2.3 unite()

4.2.3.1 unite()

You can do the inverse of separate() with unite(). unite() uses multiple input columns to create a single output column. It builds this column by pasting together the cells of the input column with a separator.

hurricanes %>%
  separate(date, c("year", "month", "day"), sep = "-") %>%
  unite(col = "date", month, day, year, sep = ":")
## # A tibble: 6 × 4
##   name    wind_speed pressure date      
##   <chr>        <dbl>    <dbl> <chr>     
## 1 Alberto        110     1007 08:03:2000
## 2 Alex            45     1009 07:27:1998
## 3 Allison         65     1005 06:03:1995
## 4 Ana             40     1013 06:30:1997
## 5 Arlene          50     1010 06:11:1999
## 6 Arthur          45     1010 06:17:1996

4.2.3.2 Syntax

hurricanes %>%
  separate(date, c("year", "month", "day"), sep = "-") %>%
  unite(col = "date", month, day, year, sep = ":")

Notice that the syntax of unite() is the inverse of separate():

  • The first argument is a character string: the name of the new column that unite() will make
  • The arguments that follow are the columns to be combine into the new column. You can list as many columns as you like, their names do not need to be in quotes, and each name is listed as its own argument.

4.2.3.3 Exercise - Separate and Unite

Use separate() and unite() to rewrite the dates in hurricanes in the format below:

  • month/day/year, e.g., 1/27/2020
hurricanes %>% separate(date, c("year", "month", "day"), sep = "-") %>%
  unite(col = date, "month", "day", "year", sep = "/")
## # A tibble: 6 × 4
##   name    wind_speed pressure date      
##   <chr>        <dbl>    <dbl> <chr>     
## 1 Alberto        110     1007 08/03/2000
## 2 Alex            45     1009 07/27/1998
## 3 Allison         65     1005 06/03/1995
## 4 Ana             40     1013 06/30/1997
## 5 Arlene          50     1010 06/11/1999
## 6 Arthur          45     1010 06/17/1996
"Good job! Let's push it one step farther."

4.2.3.4 Exercise - Separate and Unite 2

Use the chunk below to:

  1. Use separate to isolate the first two digits of each date as “century”
  2. Filter the data to just rows where century == 19. These will be storms that occurred in the 1900’s.
  3. Use unite() to return the results to the original date format. Hint: you can set sep = “” to avoid including a separator character when uniting.
hurricanes %>% 
  separate(col = date, into = c("century", "rest"), sep = 2) %>%
  filter(century == 19) %>%
  unite(col = "date", century, rest, sep = "")
## # A tibble: 5 × 4
##   name    wind_speed pressure date      
##   <chr>        <dbl>    <dbl> <chr>     
## 1 Alex            45     1009 1998-07-27
## 2 Allison         65     1005 1995-06-03
## 3 Ana             40     1013 1997-06-30
## 4 Arlene          50     1010 1999-06-11
## 5 Arthur          45     1010 1996-06-17
hurricanes %>% 
  separate(col = date, into = c("century", "rest"), sep = 2) %>%
  filter(century == "19") %>%
  unite(col = "date", century, rest, sep = "")
## # A tibble: 5 × 4
##   name    wind_speed pressure date      
##   <chr>        <dbl>    <dbl> <chr>     
## 1 Alex            45     1009 1998-07-27
## 2 Allison         65     1005 1995-06-03
## 3 Ana             40     1013 1997-06-30
## 4 Arlene          50     1010 1999-06-11
## 5 Arthur          45     1010 1996-06-17

4.2.3.5 Tidy data

So far we’ve separated and united date, a variable that contains legitimate sub-variables. This is because it makes little sense to combine unrelated values within the same cells. However, many data sets follow this senseless practice. If you inherit one, you can use separate() and unite() to reorganize the values in a tidy fashion.

In the case study that follows, you will do just that. You will also practice using all of the tidyr functions as you do.

4.2.4 Case study

4.2.4.1 who

The who data set contains a subset of data from the World Health Organization Global Tuberculosis Report, available here.

_Probably: https://extranet.who.int/tme/generateCSV.asp?ds=notifications_

In its original format, the data is very untidy

who_orig <- read_csv("who/who_TB_notification.csv")
## Rows: 8492 Columns: 177
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): country, iso2, iso3, iso_numeric, g_whoregion
## dbl (172): year, new_sp, new_sn, new_su, new_ep, new_oth, ret_rel, ret_taf, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 8,492 × 177
##    country iso2  iso3  iso_n…¹ g_who…²  year new_sp new_sn new_su new_ep new_oth
##    <chr>   <chr> <chr> <chr>   <chr>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
##  1 Afghan… AF    AFG   004     EMR      1980     NA     NA     NA     NA      NA
##  2 Afghan… AF    AFG   004     EMR      1981     NA     NA     NA     NA      NA
##  3 Afghan… AF    AFG   004     EMR      1982     NA     NA     NA     NA      NA
##  4 Afghan… AF    AFG   004     EMR      1983     NA     NA     NA     NA      NA
##  5 Afghan… AF    AFG   004     EMR      1984     NA     NA     NA     NA      NA
##  6 Afghan… AF    AFG   004     EMR      1985     NA     NA     NA     NA      NA
##  7 Afghan… AF    AFG   004     EMR      1986     NA     NA     NA     NA      NA
##  8 Afghan… AF    AFG   004     EMR      1987     NA     NA     NA     NA      NA
##  9 Afghan… AF    AFG   004     EMR      1988     NA     NA     NA     NA      NA
## 10 Afghan… AF    AFG   004     EMR      1989     NA     NA     NA     NA      NA
## # … with 8,482 more rows, 166 more variables: ret_rel <dbl>, ret_taf <dbl>,
## #   ret_tad <dbl>, ret_oth <dbl>, newret_oth <dbl>, new_labconf <dbl>,
## #   new_clindx <dbl>, ret_rel_labconf <dbl>, ret_rel_clindx <dbl>,
## #   ret_rel_ep <dbl>, ret_nrel <dbl>, notif_foreign <dbl>, c_newinc <dbl>,
## #   new_sp_m04 <dbl>, new_sp_m514 <dbl>, new_sp_m014 <dbl>, new_sp_m1524 <dbl>,
## #   new_sp_m2534 <dbl>, new_sp_m3544 <dbl>, new_sp_m4554 <dbl>,
## #   new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>, new_sp_f04 <dbl>, …
who
## # A tibble: 1,000 × 103
##    country     iso2  iso3   year new_s…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶
##    <chr>       <chr> <chr> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Afghanistan AF    AFG    1980      NA      NA      NA      NA      NA      NA
##  2 Afghanistan AF    AFG    1981      NA      NA      NA      NA      NA      NA
##  3 Afghanistan AF    AFG    1982      NA      NA      NA      NA      NA      NA
##  4 Afghanistan AF    AFG    1983      NA      NA      NA      NA      NA      NA
##  5 Afghanistan AF    AFG    1984      NA      NA      NA      NA      NA      NA
##  6 Afghanistan AF    AFG    1985      NA      NA      NA      NA      NA      NA
##  7 Afghanistan AF    AFG    1986      NA      NA      NA      NA      NA      NA
##  8 Afghanistan AF    AFG    1987      NA      NA      NA      NA      NA      NA
##  9 Afghanistan AF    AFG    1988      NA      NA      NA      NA      NA      NA
## 10 Afghanistan AF    AFG    1989      NA      NA      NA      NA      NA      NA
## # … with 990 more rows, 93 more variables: new_sp_m65 <dbl>, new_sp_mu <dbl>,
## #   new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
## #   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sp_fu <dbl>, new_sn_m04 <dbl>,
## #   new_sn_m514 <dbl>, new_sn_m014 <dbl>, new_sn_m1524 <dbl>,
## #   new_sn_m2534 <dbl>, new_sn_m3544 <dbl>, new_sn_m4554 <dbl>,
## #   new_sn_m5564 <dbl>, new_sn_m65 <dbl>, new_sn_m15plus <dbl>, …

4.2.4.2 who variables

The first four columns of who each contain a single variable:

  • country - the name of a country
  • iso2 - a two letter country code
  • iso3 - a three letter country code
  • year - year

The remaining columns are named after codes that contain multiple variables.

4.2.4.3 who codes

Each column name after the fourth contains a code comprised of three values from three variables: type of TB, gender, and age.

image

4.2.4.4 A goal

To make who easier to use in R, we should tidy it into the format below. This data set contains six non-redundant variables: country, year, type, sex, age (group), and n (the number of cases of TB reported for each group).

## # A tibble: 12,809 × 6
##    country      year type  sex   age       n
##    <chr>       <dbl> <chr> <chr> <chr> <dbl>
##  1 Afghanistan  1997 sp    m     014       0
##  2 Afghanistan  1998 sp    m     014      30
##  3 Afghanistan  1999 sp    m     014       8
##  4 Afghanistan  2000 sp    m     014      52
##  5 Afghanistan  2001 sp    m     014     129
##  6 Afghanistan  2002 sp    m     014      90
##  7 Afghanistan  2003 sp    m     014     127
##  8 Afghanistan  2004 sp    m     014     139
##  9 Afghanistan  2005 sp    m     014     151
## 10 Afghanistan  2006 sp    m     014     193
## # … with 12,799 more rows
who %>% select(-c(iso2, iso3))
## # A tibble: 1,000 × 101
##    country  year new_s…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶ new_s…⁷ new_s…⁸
##    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Afghan…  1980      NA      NA      NA      NA      NA      NA      NA      NA
##  2 Afghan…  1981      NA      NA      NA      NA      NA      NA      NA      NA
##  3 Afghan…  1982      NA      NA      NA      NA      NA      NA      NA      NA
##  4 Afghan…  1983      NA      NA      NA      NA      NA      NA      NA      NA
##  5 Afghan…  1984      NA      NA      NA      NA      NA      NA      NA      NA
##  6 Afghan…  1985      NA      NA      NA      NA      NA      NA      NA      NA
##  7 Afghan…  1986      NA      NA      NA      NA      NA      NA      NA      NA
##  8 Afghan…  1987      NA      NA      NA      NA      NA      NA      NA      NA
##  9 Afghan…  1988      NA      NA      NA      NA      NA      NA      NA      NA
## 10 Afghan…  1989      NA      NA      NA      NA      NA      NA      NA      NA
## # … with 990 more rows, 91 more variables: new_sp_f04 <dbl>, new_sp_f514 <dbl>,
## #   new_sp_f014 <dbl>, new_sp_f1524 <dbl>, new_sp_f2534 <dbl>,
## #   new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>,
## #   new_sp_f65 <dbl>, new_sp_fu <dbl>, new_sn_m04 <dbl>, new_sn_m514 <dbl>,
## #   new_sn_m014 <dbl>, new_sn_m1524 <dbl>, new_sn_m2534 <dbl>,
## #   new_sn_m3544 <dbl>, new_sn_m4554 <dbl>, new_sn_m5564 <dbl>,
## #   new_sn_m65 <dbl>, new_sn_m15plus <dbl>, new_sn_mu <dbl>, …

4.2.4.5 A strategy

Next, we need to move the type, sex, and age variables out of the column names and into a column of their own. It is true that we want to separate these values into their own cells, but that will be easier to do once they are in their own column.

In short, we want to do something like this:

image

4.2.4.6 Exercise - Reshape

Add to the pipe below. Use a tidyr reshaping function to gather the column names into their own column, named “codes”. Place the column cells into a column named “n”. Hint: it may be helpful to know that there are now 58 columns in the data set.

You can think of each column name as a key that combines the values of several variables. We want to move those keys into their own key column.

who %>%
  select(-iso2, -iso3) %>% 
  gather(key = "codes", value = "n", new_sp_m014:new_rel_f65) 
## # A tibble: 99,000 × 4
##    country      year codes           n
##    <chr>       <dbl> <chr>       <dbl>
##  1 Afghanistan  1980 new_sp_m014    NA
##  2 Afghanistan  1981 new_sp_m014    NA
##  3 Afghanistan  1982 new_sp_m014    NA
##  4 Afghanistan  1983 new_sp_m014    NA
##  5 Afghanistan  1984 new_sp_m014    NA
##  6 Afghanistan  1985 new_sp_m014    NA
##  7 Afghanistan  1986 new_sp_m014    NA
##  8 Afghanistan  1987 new_sp_m014    NA
##  9 Afghanistan  1988 new_sp_m014    NA
## 10 Afghanistan  1989 new_sp_m014    NA
## # … with 98,990 more rows

4.2.4.7 Exercise - Separate again

Our last separate, isolated two components of the who codes: new and type. However, it did not separate the sex and age variables.

If you look closely at the structure of the sexage column, you will see that each cell begins with a single letter that represents a gender, m or f, and is then followed by three or more numbers, which represent an age group. Use this insight to perform a second separate that isolates the “sex” and “age” variables:

who %>%
  select(-iso2, -iso3) %>% 
  gather(key = "codes", value = "n", new_sp_m014:new_rel_f65) %>% 
  separate(codes, into = c("new", "type", "sexage"), sep = "_")
## # A tibble: 99,000 × 6
##    country      year new   type  sexage     n
##    <chr>       <dbl> <chr> <chr> <chr>  <dbl>
##  1 Afghanistan  1980 new   sp    m014      NA
##  2 Afghanistan  1981 new   sp    m014      NA
##  3 Afghanistan  1982 new   sp    m014      NA
##  4 Afghanistan  1983 new   sp    m014      NA
##  5 Afghanistan  1984 new   sp    m014      NA
##  6 Afghanistan  1985 new   sp    m014      NA
##  7 Afghanistan  1986 new   sp    m014      NA
##  8 Afghanistan  1987 new   sp    m014      NA
##  9 Afghanistan  1988 new   sp    m014      NA
## 10 Afghanistan  1989 new   sp    m014      NA
## # … with 98,990 more rows

4.2.4.8 Exercise - Select

Add to the pipe to remove the new variable, which doesn’t provide any useful information. (Every row in the data set shows new cases of TB and has the same value of new).

who %>%
  select(-iso2, -iso3) %>% 
  gather(key = "codes", value = "n", new_sp_m014:new_rel_f65) %>% 
  separate(codes, into = c("new", "type", "sexage"), sep = "_") %>% 
  separate(sexage, into = c("sex", "age"), sep = 1)
## # A tibble: 99,000 × 7
##    country      year new   type  sex   age       n
##    <chr>       <dbl> <chr> <chr> <chr> <chr> <dbl>
##  1 Afghanistan  1980 new   sp    m     014      NA
##  2 Afghanistan  1981 new   sp    m     014      NA
##  3 Afghanistan  1982 new   sp    m     014      NA
##  4 Afghanistan  1983 new   sp    m     014      NA
##  5 Afghanistan  1984 new   sp    m     014      NA
##  6 Afghanistan  1985 new   sp    m     014      NA
##  7 Afghanistan  1986 new   sp    m     014      NA
##  8 Afghanistan  1987 new   sp    m     014      NA
##  9 Afghanistan  1988 new   sp    m     014      NA
## 10 Afghanistan  1989 new   sp    m     014      NA
## # … with 98,990 more rows

4.2.4.9 n

Notice that the n column of who contains the most insightful information. You do not need to take any measurments to list out the country, year, type, sex, and age combinations in the data set. In a sense, you know these combinations in advance. However, n shows how many cases of TB were reported for each combination. You do not know this information in advance, and you can only acquire it through field work—yours or someone else’s. As a result, it is concerning that our data contains so many NAs for n.

4.2.4.10 NA

NA is R’s symbol for missing information, and it is common to have multiple NAs when you reshape your data from a wide format to a long format. The rectangular table structure imposed by wide data requires a place holder for every combination of variable values—even if no data was collected for that combination.

In contrast, the long data format does not require a place holder for each combination of variable values. Since each combination is saved as its own row, you can simply not include rows that contain an NA.

4.2.4.11 drop_na()

The tidyr package provides a convenient function for dropping rows that contain an NA in a specific column. The function is drop_na(). To use it, give drop_na() a data set (perhaps via a pipe), then list one or more columns in that data set, e.g.

data %>% drop_na(column1, column2)

drop_na() will drop every row that contains an NA in one or more of the listed columns.

Add drop_na() to the pipe below to drop every row that has an NA in the n column.

who %>%
  select(-iso2, -iso3) %>% 
  gather(key = "codes", value = "n", new_sp_m014:new_rel_f65) %>% 
  separate(codes, into = c("new", "type", "sexage"), sep = "_") %>% 
  separate(sexage, into = c("sex", "age"), sep = 1) %>% 
  select(-new)
## # A tibble: 99,000 × 6
##    country      year type  sex   age       n
##    <chr>       <dbl> <chr> <chr> <chr> <dbl>
##  1 Afghanistan  1980 sp    m     014      NA
##  2 Afghanistan  1981 sp    m     014      NA
##  3 Afghanistan  1982 sp    m     014      NA
##  4 Afghanistan  1983 sp    m     014      NA
##  5 Afghanistan  1984 sp    m     014      NA
##  6 Afghanistan  1985 sp    m     014      NA
##  7 Afghanistan  1986 sp    m     014      NA
##  8 Afghanistan  1987 sp    m     014      NA
##  9 Afghanistan  1988 sp    m     014      NA
## 10 Afghanistan  1989 sp    m     014      NA
## # … with 98,990 more rows
who %>%
  select(-iso2, -iso3) %>% 
  gather(key = "codes", value = "n", new_sp_m014:new_rel_f65) %>% 
  separate(codes, into = c("new", "type", "sexage"), sep = "_") %>% 
  separate(sexage, into = c("sex", "age"), sep = 1) %>% 
  select(-new) %>%
  drop_na(n)
## # A tibble: 12,809 × 6
##    country      year type  sex   age       n
##    <chr>       <dbl> <chr> <chr> <chr> <dbl>
##  1 Afghanistan  1997 sp    m     014       0
##  2 Afghanistan  1998 sp    m     014      30
##  3 Afghanistan  1999 sp    m     014       8
##  4 Afghanistan  2000 sp    m     014      52
##  5 Afghanistan  2001 sp    m     014     129
##  6 Afghanistan  2002 sp    m     014      90
##  7 Afghanistan  2003 sp    m     014     127
##  8 Afghanistan  2004 sp    m     014     139
##  9 Afghanistan  2005 sp    m     014     151
## 10 Afghanistan  2006 sp    m     014     193
## # … with 12,799 more rows

4.2.4.12 Recap

Good job! You’ve wrangled who into a tidy, polished data set that is ready to be explored, modelled, and analyzed.

The difference between the initial and final versions of who is drastic, but each step in our pipe imposed a small, logical change. This is by design.

The tidyverse contains a vocabulary of functions that each do one simple thing, but can be combined to do more sophisticated tasks. In this way, the tidyverse is like a written language, it is made up of words (functions) that can be combined into sentences that have a sophisticated meaning (pipes).

This approach also makes it easier to solve problems with code. You can approach any problem by decomposing it into a series of small, simple steps.

4.3 Join Data Sets

Complete your data wrangling education by learning to work with relational data. Here you will learn how to augment data sets with information from related data sets, as well as how to filter one data set against another.

4.3.1 Welcome

Data often comes as multiple data sets that are related to each other. When this happens, the data will be easier to analyze if you join the data sets into a single table. This tutorial will teach you several functions that join data sets together. These functions do something sophisticated: they match rows from one data set to corresponding rows in another data set, even if the rows appear in a different order. The functions are:

  • left_join(), right_join(), full_join(), and inner_join() - which augment a copy of one data frame with information from a second
  • semi_join() and anti_join() - which filter the contents of one data frame against the contents of a second
  • bind_rows(), bind_cols(), and set operations - which combine data sets in more simple ways

Each of these functions come in the dplyr package, not the tidyr package. You may wonder why we are learning about them in the Tidy Data primer. Joins are a useful component of data tidying; your data can hardly be tidy if observations are split across multiple data frames where they are listed in different orders.

This tutorial uses the core tidyverse packages, including dplyr, as well as the nycflights13 package. All of these packages have been pre-installed and pre-loaded for your convenience.

Click the Next Topic button to begin.

4.3.2 Mutating Joins

4.3.2.1 Which airlines have the largest arrival delays?

Flight delays are an unfortunate aspect of air travel. If you’ve flown more than a handful of times, you’ve probably experienced a delayed flight, which may make you wonder: is it possible to predict which flights will be delayed?

The flights data set in the nycflights13 package provides some relevant information. It contains details of every flight that departed from an airport that serves New York City in 2013. Let’s use it to explore which airlines have the largest flight delays.

flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

4.3.2.2 Review - Which airlines have the largest arrival delays?

The carrier variable of flights uses a carrier code to identify which airline operated each flight. This gives us a strategy for comparing the average delay time by airline:

  1. Premeptively drop all of the rows that have an NA in arr_delay, which is the variable that records how delayed each flight was when it arrived at its destination (flights with a negative arr_delay arrived early).
  2. Group the data by carrier
  3. Calculate avg_delay: the average delay per carrier group
  4. Arrange the carrier groups in descending order according to their avg_delay scores. The carriers with the largest average delays will appear at the top of the list.

Use dplyr functions in the code chunk below to enact this strategy. Which airlines have the largest average delays?

flights %>% 
  drop_na(arr_delay) %>% 
  group_by(carrier) %>%
  summarise(avg_delay = mean(arr_delay)) %>%
  arrange(desc(avg_delay))
## # A tibble: 16 × 2
##    carrier avg_delay
##    <chr>       <dbl>
##  1 F9         21.9  
##  2 FL         20.1  
##  3 EV         15.8  
##  4 YV         15.6  
##  5 OO         11.9  
##  6 MQ         10.8  
##  7 WN          9.65 
##  8 B6          9.46 
##  9 9E          7.38 
## 10 UA          3.56 
## 11 US          2.13 
## 12 VX          1.76 
## 13 DL          1.64 
## 14 AA          0.364
## 15 HA         -6.92 
## 16 AS         -9.93
"Good job! You've calculated the average delay per airline, but the results are difficult to interpret. We don't know which codes are associated with which airlines."

4.3.2.3 airlines

Our results show that the carrier F9 had the worst record for delays in the New York City area in 2013. But unless you are an air traffic controller, you probably do not know which airline has the carrier code F9.

Luckily, the nycflights13 package comes with another data set, airlines, which matches the name of each airline to its carrier code.

airlines
## # A tibble: 16 × 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

4.3.2.4 A join

While you could look up F9 manually in airlines, and then repeat that process for every other code, the task would not be enjoyable. Your boss or your client will probably not be as willing as you to do it.

A better solution would be to join the airlines data set to your results programatically. In other words, to instruct R to add the name that is associated with each carrier code in airlines to the row that is associated with each carrier code in your results.

This is easy to do with one of dplyr’s four join functions: left_join(), right_join(), full_join(), and inner_join(). Each performs a variation of the basic task above.

4.3.2.5 Toy data

The easiest way to learn how join functions work is visually. To this end, I’ve created some small toy data sets that we can visualize in their entirety: band and instrument, which look like this:

image

Notice that each data set has a column named name. Also, notice that each data set contains a row about John and a row about Paul. If you know a little about The Beatles, you’ll recognize that these rows match: they describe the same people. On ther other hand, the rows named Mick and Keith do not match any rows in the other data set. Finally, notice that the matching rows do not appear in the same place in each data set. For example, John is in the second row of band, but the first row of instrument.

These small data sets do a good job of matching the haphazard nature of real data. Our job will be to join them into a single data set that correctly matches the John and Paul rows to each other.

If you wish to see the raw data in band and instrument, take a peek by running the code below.

band
## # A tibble: 3 × 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
instrument
## # A tibble: 3 × 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar

4.3.2.6 left_join()

Let’s look at each dplyr join function and then deconstruct their syntax.

The left_join() function returns a copy of a data set that is augmented with information from a second data set. It retains all of the rows of the first data set, and only adds rows from the second data set that match rows in the first.

So here, Mick is retained in the result (with an NA in the appropriate spot) because Mick appears in the first data set. On the other hand, Kieth does not appear in the result because Keith does not appear in the first data set.

image To see what this result looks like in R, run the code below.

band %>% left_join(instrument, by = "name")
## # A tibble: 3 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

4.3.2.7 right_join()

right_join() does the opposite of left_join(); it retains every row from the second data set and only adds rows from the first data set that have a match in the second data set. Now Keith appears in the result because Keith appears in the second data set. On the other hand, Mick does not appear in the result because he does not appear in the second data set.

image

You can think of left_join() as prioritizing the first data set, and right_join() as prioritizing the second. To see the results in R, run the code below.

band %>% right_join(instrument, by = "name")
## # A tibble: 3 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass  
## 3 Keith <NA>    guitar

4.3.2.8 Test your comprehension

How can you swap the names in the code below to attain the results pictured in the right join diagram (don’t worry about the order of the columns the result).

image

band %>% left_join(instrument, by = "name")
band %>% right_join(instrument, by = "name")
## # A tibble: 3 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass  
## 3 Keith <NA>    guitar
"Good Job! Since right and left joins are analagous, you can acheive the same results by switching the order of the data sets in a left join. Notice that this will affect the column order."

4.3.2.9 full_join()

A full_join() is more inclusive than either a right_join() or a left_join(). A full_join() retains every row from each data sets, inserting NA placeholders throughout the results as necessary.

This is the only join that does not lose any information from the original data sets. Both Mick and Kieth appear in the results.

image To see what this result looks like in R, run the code below.

band %>% full_join(instrument, by = "name")
## # A tibble: 4 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

4.3.2.10 inner_join()

In contrast, an inner_join() is the most exclusive join. It only retains rows that appear in both data sets. As a result, only John and Paul appear in the result. Mick and Keith are left behind.

image

To see what this result looks like in R, run the code below.

band %>% inner_join(instrument, by = "name")
## # A tibble: 2 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass

4.3.2.11 Mutating join syntax

These four joins, left_join(), right_join(), full_join(), and inner_join(), are called mutating joins because they each return a copy of a data set that has been augmented with new information, just as mutate() returns a copy of a data set that has been augmented with new information.

Each function uses the same syntax:

 left_join(band, instrument, by = "name")
right_join(band, instrument, by = "name")
 full_join(band, instrument, by = "name")
inner_join(band, instrument, by = "name")

First, pass the function the names of two data sets to join.

Then set the by argument to the name or names of the column or columns to join on. These names should be passed as a vector of character strings, i.e. characters surrounded by quotes. In the code above, we join on a single column so our vector of strings simplifies to a single string, but you could imagine doing something like left_join(band, instrument, by = c(“first”, “last”)).

Each column name in by should appear in both data sets. The join function will match together rows that have identical combinations of values in the columns listed in by. If you do not specify a by argument, dplyr will join on the set of all column names that appear in both data sets.

4.3.2.12 Exercise - Which airlines?

Now that you’ve familiarized yourself with the mutating join functions, let’s use one to finish our airlines query. Add two more lines to the code below.

  1. In the first, join the results to airlines in a way that keeps every row of the results, but only the matching rows of airlines.
  2. In the second, select just the name and avg_delay columns in that order.
flights %>% 
  drop_na(arr_delay) %>% 
  group_by(carrier) %>% 
  summarise(avg_delay = mean(arr_delay)) %>% 
  arrange(desc(avg_delay)) %>%
  left_join(airlines, by = "carrier") %>%
  select(name, avg_delay)
## # A tibble: 16 × 2
##    name                        avg_delay
##    <chr>                           <dbl>
##  1 Frontier Airlines Inc.         21.9  
##  2 AirTran Airways Corporation    20.1  
##  3 ExpressJet Airlines Inc.       15.8  
##  4 Mesa Airlines Inc.             15.6  
##  5 SkyWest Airlines Inc.          11.9  
##  6 Envoy Air                      10.8  
##  7 Southwest Airlines Co.          9.65 
##  8 JetBlue Airways                 9.46 
##  9 Endeavor Air Inc.               7.38 
## 10 United Air Lines Inc.           3.56 
## 11 US Airways Inc.                 2.13 
## 12 Virgin America                  1.76 
## 13 Delta Air Lines Inc.            1.64 
## 14 American Airlines Inc.          0.364
## 15 Hawaiian Airlines Inc.         -6.92 
## 16 Alaska Airlines Inc.           -9.93

4.3.2.13 nycflights13 data sets

airlines is not the only data set in nycflights13 that expands upon flights. nycflights13 contains a total of five data sets that each focus on a related aspect of air travel.

  1. flights - describes each flight that departed from a New York City airport (i.e. Newark, La Guardia, or JFK)
  2. airports- describes major airports in the US, including their FAA codes and names
  3. planes - describes the individual airplanes, identified by their tail numbers
  4. weather - describes the hourly weather conditions for each NYC airport
  5. airlines - lists the carrier codes and names for each airline

The diagram below lists the column names for each data set. You can see that each data set shares one or more common columns with flights. Let’s use one to answer a new query.

image #### Which airports have the largest arrival delays?

Let’s repeat our last investigation to see which destinations have the largest average arrival delays. By swapping carrier with dest we arrive at

flights %>% 
  drop_na(arr_delay) %>%
  group_by(dest) %>%
  summarise(avg_delay = mean(arr_delay)) %>%
  arrange(desc(avg_delay))
## # A tibble: 104 × 2
##    dest  avg_delay
##    <chr>     <dbl>
##  1 CAE        41.8
##  2 TUL        33.7
##  3 OKC        30.6
##  4 JAC        28.1
##  5 TYS        24.1
##  6 MSN        20.2
##  7 RIC        20.1
##  8 CAK        19.7
##  9 DSM        19.0
## 10 GRR        18.2
## # … with 94 more rows

But we face a similar problem. How can we replace the dest codes with names?

4.3.2.14 airports

Luckily, the airports data set shows the names associated with each code. But look closely at airports:

airports
## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # … with 1,448 more rows

Which variable name does airports use for the airport codes?

  • dest ✗
  • origin ✗
  • name ✗
  • faa ✓
Correct!

This makes it difficult to join these two data sets because flights and airports use different column names for the codes columns (dest and faa).

4.3.2.15 Different column names

airports and flights share a common variable, airport codes, but save the variable under different column names dest and faa. This is a common occurence with data. We can recreate it by making a second instrument data set that replaces the first column name with artist.

instrument2
## # A tibble: 3 × 2
##   artist plays 
##   <chr>  <chr> 
## 1 John   guitar
## 2 Paul   bass  
## 3 Keith  guitar

We can still join band to insturment2, but we will need to tell R to match the name column to the artist column. To do this, you will need to know a little about how to name the elements of a vector.

image

4.3.2.16 Named vectors

A named vector is a vector whose elements have been given names. To create a named vector, simply assign names to each element of the vector when you create the vector with c().

For example, this creates an unnamed vector:

c(1, 2, 3)
## [1] 1 2 3

And this creates a named vector. Here the first element is named “uno”, the second is named “dos”, and so on.

c(uno = 1, dos = 2, tres = 3)
##  uno  dos tres 
##    1    2    3

If you like, you can place quotes around the names when you make the vector, like c(“uno” = 1, “dos” = 2, “tres” = 3). You’ll see me do that in the next section to make things look symmetric.

Named vectors are a basic feature of R. Let’s look at how we can use them to solve our join problem.

4.3.2.17 Matching column names

To match on columns with different names, change the by argument of your join function from a vector of character strings to a named vector of character strings.

band %>% left_join(instrument2, by = c("name" = "artist"))
## # A tibble: 3 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

R will match the column in the first data set that has the name (here “name”) with the column in the second data set that has the element (here “artist”).

image

To see what the result looks like in R, run the code below.

band %>% left_join(instrument2, by = c("name" = "artist"))
## # A tibble: 3 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

4.3.2.18 Two details

You can use this syntax to describe multiple pairs of columns. For example,

foo %>% left_join(foo2, by = c("first" = "artist1", "last" = "artist2"))

Technically, you do not need to surround the names of the vector with quotes. This would work.

foo %>% left_join(foo2, by = c(first = "artist1", last = "artist2"))

But you do need to use quotes in the elements of the vector, which are character strings. I like to use quotes on both sides of the = for parity.

4.3.2.19 Exercise - Which airports have the largest arrival delays?

Complete our code below to show the name of each destination paired with its average arrival delay.

flights %>% 
  drop_na(arr_delay) %>%
  group_by(dest) %>%
  summarise(avg_delay = mean(arr_delay)) %>%
  arrange(desc(avg_delay))
flights %>% 
  drop_na(arr_delay) %>%
  group_by(dest) %>%
  summarise(avg_delay = mean(arr_delay)) %>%
  arrange(desc(avg_delay)) %>%
  left_join(airports, by = c("dest" = "faa")) %>%
  select(name, avg_delay)
## # A tibble: 104 × 2
##    name                          avg_delay
##    <chr>                             <dbl>
##  1 Columbia Metropolitan              41.8
##  2 Tulsa Intl                         33.7
##  3 Will Rogers World                  30.6
##  4 Jackson Hole Airport               28.1
##  5 Mc Ghee Tyson                      24.1
##  6 Dane Co Rgnl Truax Fld             20.2
##  7 Richmond Intl                      20.1
##  8 Akron Canton Regional Airport      19.7
##  9 Des Moines Intl                    19.0
## 10 Gerald R Ford Intl                 18.2
## # … with 94 more rows
"Good Job! Flights from NYC to Columbia, South Carolina seem to have arrived particularly late in 2013. At the other end of the list, far off destinations in Alaska and Hawaii tended to arrive ahead of schedule."

4.3.2.20 Mutating Joins Recap

The four join functions cover all of the ways you can combine information from one data set with another data set.

  • left_join() - joins relevant data from the second data set to the first
  • right_join() - joins relevant data from the first data set to the second
  • full_join() - retains all available data
  • inner_join() - retians only observations that appear in both data sets

If you wish to combine more than two data sets, you can run the joins sequentially, first joining two data sets, then joining the result to a third, and so on. This process is easy to automate with the reduce() function in the purrr package.

The next Topic will look at a group of joins that do something surprisingly different.

4.3.3 Filtering Joins

4.3.3.1 Destinations

Let’s look more closely at the destinations of flights from New York City.

To do this we will use a new type of join: a filtering join. Filtering joins are different than mutating joins in that they do not add new data to a data set. Instead, they filter the rows of a data set based on whether or not the rows match rows in a second data set.

dplyr comes with two filtering join functions:

  • semi_join()
  • anti_join()

Both follow the same syntax as the mutating joins.

4.3.3.2 semi_join()

semi_join() returns every row in the first data set that has a match in the second data set. So, for example, here semi_join() returns the John and Paul rows of band. Notice that semi_join() has not added anything to those rows.

image To see what the results look like in R, run the code below.

band %>% semi_join(instrument, by = "name")
## # A tibble: 2 × 2
##   name  band   
##   <chr> <chr>  
## 1 John  Beatles
## 2 Paul  Beatles

4.3.3.3 anti_join()

anti_join() does just the opposite of semi_join(); it returns all of the rows in the first data set that do not have a match in the second data set.

image

band %>% anti_join(instrument, by = "name")
## # A tibble: 1 × 2
##   name  band  
##   <chr> <chr> 
## 1 Mick  Stones

4.3.3.4 distinct()

We will also use a new function that comes in dplyr: distinct(). distinct() isn’t a join function, but it is incredibly useful. distinct() returns the distinct values of a column.

instrument %>% distinct(plays)
## # A tibble: 2 × 1
##   plays 
##   <chr> 
## 1 guitar
## 2 bass

image

If you do not supply a column, distinct() returns the distinct rows of the data frame, removing duplicates.

Now let’s put these three functions to work.

4.3.3.5 How many airports does New York connect to?

Use distinct() below to determine how many airports New York City connects to. This will be the number of distinct destinations in the flights data set. First create a data set with these destinations, then look for the number of rows in the data (it appears beneath the table in the results).

flights %>% 
  distinct(dest)
## # A tibble: 105 × 1
##    dest 
##    <chr>
##  1 IAH  
##  2 MIA  
##  3 BQN  
##  4 ATL  
##  5 ORD  
##  6 FLL  
##  7 IAD  
##  8 MCO  
##  9 PBI  
## 10 TPA  
## # … with 95 more rows

4.3.3.6 Exercise - Replace codes with names

Now let’s replace these codes with recognizable names. Add to the code below to left join our results to airports. Remember that the two data sets use different column names. Then select just the name column.

flights %>% 
  distinct(dest) 
flights %>% 
  distinct(dest) %>%
  left_join(airports, by = c("dest" = "faa")) %>%
  select(name)
## # A tibble: 105 × 1
##    name                           
##    <chr>                          
##  1 George Bush Intercontinental   
##  2 Miami Intl                     
##  3 <NA>                           
##  4 Hartsfield Jackson Atlanta Intl
##  5 Chicago Ohare Intl             
##  6 Fort Lauderdale Hollywood Intl 
##  7 Washington Dulles Intl         
##  8 Orlando Intl                   
##  9 Palm Beach Intl                
## 10 Tampa Intl                     
## # … with 95 more rows

4.3.3.7 NAs

Rolling back our results just a bit, you can see that some codes did not have a match with in the airports data set.

flights %>% 
  distinct(dest) %>% 
  left_join(airports, by = c("dest" = "faa")) %>% 
  select(dest, name)
## # A tibble: 105 × 2
##    dest  name                           
##    <chr> <chr>                          
##  1 IAH   George Bush Intercontinental   
##  2 MIA   Miami Intl                     
##  3 BQN   <NA>                           
##  4 ATL   Hartsfield Jackson Atlanta Intl
##  5 ORD   Chicago Ohare Intl             
##  6 FLL   Fort Lauderdale Hollywood Intl 
##  7 IAD   Washington Dulles Intl         
##  8 MCO   Orlando Intl                   
##  9 PBI   Palm Beach Intl                
## 10 TPA   Tampa Intl                     
## # … with 95 more rows

4.3.3.8 Which codes did not match?

This is unexpected. It would be useful to see which codes did not have a match. Extend the code below with a filtering join to return just the rows that do not have a match in airports.

flights %>%
  distinct(dest)
flights %>%
  distinct(dest) %>%
  anti_join(airports, by = c("dest" = "faa"))
## # A tibble: 4 × 1
##   dest 
##   <chr>
## 1 BQN  
## 2 SJU  
## 3 STT  
## 4 PSE

4.3.3.9 doublechecking with anti_join()

anti_join() provides an easy way to double check a join. It shows whether or not all of the rows that you think will have a match will have a match.

Its not uncommon for anti_join() to return values that have a misspelling or typo that prevents the join. Keep in mind that the typo could be in either data set.

Here, these appear to be real airport codes that have been overlooked by airports. We cannot check the names of these four airports because, by definition, they are not in our data set of airport names.

4.3.3.10 Exercise - How many flights are associated with a known airport name?

Let’s gauge how this affects our data. Use the code chunk below to return all of the flights that do match an airport in airports. Be sure to use a filtering join, not a mutating join.

flights %>%
  semi_join(airports, by = c("dest" = "faa"))
## # A tibble: 329,174 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      554        600      -6     812     837     -25 DL     
##  5  2013     1     1      554        558      -4     740     728      12 UA     
##  6  2013     1     1      555        600      -5     913     854      19 B6     
##  7  2013     1     1      557        600      -3     709     723     -14 EV     
##  8  2013     1     1      557        600      -3     838     846      -8 B6     
##  9  2013     1     1      558        600      -2     753     745       8 AA     
## 10  2013     1     1      558        600      -2     849     851      -2 B6     
## # … with 329,164 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
flights %>%
  anti_join(airports, by = c("dest" = "faa"))
## # A tibble: 7,602 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  2  2013     1     1      615        615       0    1039    1100     -21 B6     
##  3  2013     1     1      628        630      -2    1137    1140      -3 AA     
##  4  2013     1     1      701        700       1    1123    1154     -31 UA     
##  5  2013     1     1      711        715      -4    1151    1206     -15 B6     
##  6  2013     1     1      820        820       0    1254    1310     -16 B6     
##  7  2013     1     1      820        820       0    1249    1329     -40 DL     
##  8  2013     1     1      840        845      -5    1311    1350     -39 AA     
##  9  2013     1     1      909        810      59    1331    1315      16 AA     
## 10  2013     1     1      913        918      -5    1346    1416     -30 UA     
## # … with 7,592 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

4.3.3.11 semi_join() for filtering

How would you write a filter() statement that finds just the flights that:

  1. Departed in January on JetBlue, or
  2. Departed in February on Southwest?

It can be done—as can many other complicated filters. But you may find it easier to perform complicated filters with semi_join() instead of filter().

4.3.3.12 A semi_join() filter 1

For example, you can create a data set that has the combinations you want:

criteria <- tribble(
  ~month, ~carrier,
       1,     "B6", # B6 = JetBlue
       2,     "WN"  # WN = Southwest
)

criteria
## # A tibble: 2 × 2
##   month carrier
##   <dbl> <chr>  
## 1     1 B6     
## 2     2 WN

Then you can run a semi_join() against the data set. Use criteria and semi_join() below to return just the flights that left in January on JetBlue or in February on Southwest.

flights %>%
  semi_join(criteria)
## Joining, by = c("month", "carrier")
## # A tibble: 5,338 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  2  2013     1     1      555        600      -5     913     854      19 B6     
##  3  2013     1     1      557        600      -3     838     846      -8 B6     
##  4  2013     1     1      558        600      -2     849     851      -2 B6     
##  5  2013     1     1      558        600      -2     853     856      -3 B6     
##  6  2013     1     1      559        559       0     702     706      -4 B6     
##  7  2013     1     1      600        600       0     851     858      -7 B6     
##  8  2013     1     1      601        600       1     844     850      -6 B6     
##  9  2013     1     1      613        610       3     925     921       4 B6     
## 10  2013     1     1      615        615       0    1039    1100     -21 B6     
## # … with 5,328 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

4.3.3.13 Filtering Joins Recap

Filtering joins filter a data set against the observations in a second data set. They are called joins because they use information from both data sets. However, they use this information to filter—not augment—the original data.

  • semi_join() returns rows that have a match in the second data set. It provides a useful shortcut for complicated filtering.
  • anti_join() returns rows that do not have a match in the second data set. It provides a useful way to check for possible errors in a join.

distinct() is not a join at all, but it does filter data sets in a useful way.

The last topic in this tutorial will cover straight-forward ways for combining data sets. These methods require your data sets to be pre-formatted to fit together and they are fairly simple to understand.

4.3.4 Binds and Set Operations

4.3.4.1 Non-joins

Join functions specialize in data sets that relate to each other, but are not preformatted to fit together.

Sometimes, however, you may wish to paste together data sets that already “fit together”, as if they were split as is from some master data set. The functions in this topic will show you how.

4.3.4.2 How to combine columns that already align?

Consider the two data sets below. Notice that they contain different variables, but identical observations. For example, the first row in beatles1 aligns with the first row of beatles2, the second row aligns with the second row, and so on.

image You wouldn’t need to do a join to combine these data sets, you’d just need to paste them together. How could you do it?

4.3.4.3 bind_cols()

If your data sets contain the same observations, in the same order, you can combine them together with bind_cols()

image

Run the code below to see what the results look like in R.

beatles1 %>% bind_cols(beatles2)
## # A tibble: 4 × 4
##   band    name   surname   instrument
##   <chr>   <chr>  <chr>     <chr>     
## 1 Beatles John   Lennon    guitar    
## 2 Beatles Paul   McCartney bass      
## 3 Beatles George Harrison  guitar    
## 4 Beatles Ringo  Starr     drums

Note that this is a dangerous way to store your data, because it is hard to ensure that the rows of one data set haven’t gotten jumbled. bind_cols() cannot tell whether the rows are in the correct order or not, so you will need to be careful in these situations.

4.3.4.4 How to combine rows that align?

These data sets provide the opposite case, which is more common. Here each data set contains the same variables, but different observations. You could think of band2 as a continuation of band.

image

4.3.4.5 bind_rows()

Use bind_rows() to combine data sets that contain the same variables, but different observations.

image Run the code below to see what the results look like in R.

band %>% bind_rows(band2)
## # A tibble: 6 × 2
##   name   band   
##   <chr>  <chr>  
## 1 Mick   Stones 
## 2 John   Beatles
## 3 Paul   Beatles
## 4 Ringo  Beatles
## 5 Ronnie Stones 
## 6 Mick   Stones

4.3.4.6 .id

When conbining data with bind_rows(), it can be useful to add a new column that shows where each row came from.

image

The easiest way to do this is to save the input data sets as a named list and call bind_rows on the list. Then add the argument .id to your bind_rows() call and set .id to a character string. bind_rows() will use the character string as the name of a new column that displays the name of the data set that each row comes from (as determined by the names in the list).

If you’d like to refresh your understanding of lists in R, revisit the Programming Basics tutorial.

Add a .id argument ot the code below to create the output displayed in the diagram.

bands <- list(df1 = band, 
              df2 = band2)

bands %>% bind_rows()
bands <- list(df1 = band, 
              df2 = band2)

bands %>% bind_rows(.id = "origin")
## # A tibble: 6 × 3
##   origin name   band   
##   <chr>  <chr>  <chr>  
## 1 df1    Mick   Stones 
## 2 df1    John   Beatles
## 3 df1    Paul   Beatles
## 4 df2    Ringo  Beatles
## 5 df2    Ronnie Stones 
## 6 df2    Mick   Stones
"Good Job! You can add more than two data sets to your list if you wish to bind together multiple data sets at once."

4.3.4.7 Set operations

Did you notice that bands and bands2 contain a duplicate row? Each contains a row for Mick.

When your data sets contain the same variables and overlapping sets of observations, you can use traditional set operations to return a reduced set of rows drawn from the data sets.

image Imagine what each of the set operations below will return when applied to the data sets above. Then run the code to check if you are right.

band %>% union(band2)
## # A tibble: 5 × 2
##   name   band   
##   <chr>  <chr>  
## 1 Mick   Stones 
## 2 John   Beatles
## 3 Paul   Beatles
## 4 Ringo  Beatles
## 5 Ronnie Stones
band %>% intersect(band2)
## # A tibble: 1 × 2
##   name  band  
##   <chr> <chr> 
## 1 Mick  Stones
band %>% setdiff(band2)
## # A tibble: 2 × 2
##   name  band   
##   <chr> <chr>  
## 1 John  Beatles
## 2 Paul  Beatles
band2 %>% setdiff(band)
## # A tibble: 2 × 2
##   name   band   
##   <chr>  <chr>  
## 1 Ringo  Beatles
## 2 Ronnie Stones

4.3.4.8 union()

union() returns every row that appears in either data set, but it removes duplicate copies of the rows.

band %>% union(band2)

image

4.3.4.9 intersect()

intersect() returns only the rows that appear in both data sets. It too removes duplicate copies of these rows.

band %>% intersect(band2)

image

4.3.4.10 setdiff()

setdiff() returns all of the rows that appear in the first data set but not the second.

band %>% setdiff(band2)

image

5 Iterate

Master a core programming paradigm with the purrr package: for each ____ do ____.

5.1 Introduction to Iteration

Iteration is the task of applying a function iteratively to each element in a vector. This tutorial will explain what a vector is (it might not be what you think!) and introduce three three ways to do iteration in R: for loops, the lapply functions, and the purrr package.

5.2 Map

purrr’s family of map functions makes iteration quick and easy. Here you will learn the ins and outs of map() and its variants.

5.3 Map Shortcuts

Here you will learn map()’s built in shortcuts for the most common map tasks, as well as an expression language that lets you map more than functions.

5.4 Multiple Vectors

Now that you know how to iterate over a single vector, it is time to learn how to iterate over two or more vectors at once, or even a vector of functions.

5.5 List Columns

Here you will use your iteration skills to overhaul your entire data analysis workflow (if you want). Iteration facilitates a very useful new way to organize the products of data science.

6 Write Functions

Functions are the key to programming in R. This primer will teach you how to write and use your own reusable functions.

6.1 Function Basics

Start here to learn what a function really is. This quick tutorial explains the structure of functions and how to call them.

6.2 How to Write a Function

Here it is! The best practice workflow for writing your own functions in R. You’ll also learn some shortcuts for converting common types of code into R functions.

6.3 Arguments

Arguments are the user interface (UI) to your functions. This extended quiz will teach you the ins and outs of writing, and using, a good argument UI. Environments and Scoping Rules Here you will learn how R stores and looks up objects.

6.4 Control Flow

Control flow refers to the order in which a function executes its code. Here you’ll learn how to run specific code in specific cases with if and else, and how to stop function execution early with return() and stop().

6.5 Advanced Control Flow

Next it is time to learn how to combine logical tests in if statements, as well as how to write if statements that work with vectors. This is a prerequisite for using if in vectorized functions.

6.6 Loops

Learn to repeat code with R’s repeat, while, and for loops. You’ll also learn to recognize when you should or shouldn’t use loops in R.

7 Report Reproducibly

Learn to report, reproduce, and parameterize your work with the best authoring format for Data Science: R Markdown.

8 Build Interactive Web Apps

Say hello to Shiny, R’s package for building interactive web apps. Learn to turn your analyses into elegant tools to share with others

9 Organize Your Work

Become an R guru by mastering all of the tools built into the RStudio IDE. Discover best practices for programming, debugging, version control, package building and more.